Closing price of Google Stock Prediction

1. Problem

Predicting the closing price of a stock is a complex problem because of several challenges. Stock prices are influenced by a multitude of factors such as market trends, Analyzing and incorporating all these factors accurately into a predictive model is a complex task. Market volatility makes predicting stock prices accurately challenging. Data Quality and Quantity, the pursuit of solving this problem is crucial because accurate stock price predictions have significant implications for investors, financial institutions, and businesses. Accurate predictions can aid investors in making informed decisions. The importance of predicting stock prices lies in its implications for investors, financial institutions, and businesses, it can potentially help investors make more informed decisions about buying, selling, or holding stocks, aiding in risk.

2. Data mining Task

In our project, we will use two data mining tasks to help us predict the closing price of a stock. two of the methods you can consider are classification and clustering. For classification, we will train our model to be able to classify the close price based on a set of attributes such as volume, open, high, low, length etc. For clustering, we will partition closing prices into subnets or clusters, where they are similar to prices in cluster but dissimilar to prices in other clusters based on the attributes Low, Heigh, Open, volume, adjClose, adjHigh.

3. Data

Our dataset is from the source: https://www.kaggle.com/datasets/shreenidhihipparagi/google-stock-prediction

Number of Attributes: 14

Number of objects: 1258

Attribute characteristics:

Attribute Name Data Type Description
symbol unique value Name of company
date numeric date: day, month, and year.
close numeric closing price of a stock is the final price at which a stock is traded on a given trading day.
high numeric The highest price at which a stock traded during a specific trading day.
low numeric The lowest price at which a stock traded during a specific trading day.
open numeric The price of a stock at the beginning of a trading day. It’s the price at which the first trade occurred on that day.
Volume numeric The total number of shares traded during a trading day. Volume is a measure of market activity and liquidity for a stock
adjClose numeric The closing price of a stock adjusted for any corporate actions like dividends, stock splits, or other events that could affect the stock price.
adjHigh numeric The highest price of a stock during a trading day, adjusted for any corporate actions
adjLow numeric The lowest price of a stock during a trading day, adjusted for any corporate actions.
adjOpen numeric The opening price of a stock at the beginning of a trading day, adjusted for any corporate actions.
adjVolume numeric The trading volume of a stock adjusted for any corporate actions. This can provide a clearer picture of tranding activity.
divCash Binary The amount of money paid by a company to its shareholders as a portion of its profits. Dividends are typically paid on a per-share basis
s plitFactor Binary If a stock undergoes a stock split, the split factor indicates the ratio by which the shares were split. For instance, a 2-for-1 split means that for every old share, you now have 2 new shares.
# Load necessary packages
if (!require(caret)) {
  install.packages("caret")
}
Loading required package: caret
Loading required package: lattice
if (!require(cluster)) {
  install.packages("cluster")
}
if (!require(fpc)) {
  install.packages("fpc")
}
Loading required package: fpc
Warning: package ‘fpc’ was built under R version 4.3.2
if (!require(ggplot2)) {
  install.packages("ggplot2")
}
library(caret)
library(cluster)
library(fpc)
library(ggplot2)
dataset = read.csv('Google.csv') 
View(dataset)
print(dataset)

we removed the attributes (symbol, divCash, splitFactor) as they have one value only so we do not need them

dataset=dataset[,2:12]

Convert the date column to a date format

dataset$date <- as.Date(dataset$date, format = "%Y-%m-%d %H:%M:%S")
print(dataset)
str(dataset)
'data.frame':   1258 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : num  718 719 710 692 694 ...
 $ high     : num  722 723 717 709 702 ...
 $ low      : num  713 717 703 688 693 ...
 $ open     : num  716 719 715 709 699 ...
 $ volume   : int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
 $ adjClose : num  718 719 710 692 694 ...
 $ adjHigh  : num  722 723 717 709 702 ...
 $ adjLow   : num  713 717 703 688 693 ...
 $ adjOpen  : num  716 719 715 709 699 ...
 $ adjVolume: int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
summary(dataset)
      date                close             high       
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3  
 1st Qu.:2017-09-12   1st Qu.: 960.8   1st Qu.: 968.8  
 Median :2018-12-11   Median :1132.5   Median :1143.9  
 Mean   :2018-12-12   Mean   :1216.3   Mean   :1227.4  
 3rd Qu.:2020-03-12   3rd Qu.:1360.6   3rd Qu.:1374.3  
 Max.   :2021-06-11   Max.   :2521.6   Max.   :2527.0  
      low              open          volume           adjClose     
 Min.   : 663.3   Min.   : 671   Min.   : 346753   Min.   : 668.3  
 1st Qu.: 952.2   1st Qu.: 959   1st Qu.:1173522   1st Qu.: 960.8  
 Median :1117.9   Median :1131   Median :1412588   Median :1132.5  
 Mean   :1204.2   Mean   :1215   Mean   :1601590   Mean   :1216.3  
 3rd Qu.:1348.6   3rd Qu.:1361   3rd Qu.:1812156   3rd Qu.:1360.6  
 Max.   :2498.3   Max.   :2525   Max.   :6207027   Max.   :2521.6  
    adjHigh           adjLow          adjOpen       adjVolume      
 Min.   : 672.3   Min.   : 663.3   Min.   : 671   Min.   : 346753  
 1st Qu.: 968.8   1st Qu.: 952.2   1st Qu.: 959   1st Qu.:1173522  
 Median :1143.9   Median :1117.9   Median :1131   Median :1412588  
 Mean   :1227.4   Mean   :1204.2   Mean   :1215   Mean   :1601590  
 3rd Qu.:1374.3   3rd Qu.:1348.6   3rd Qu.:1361   3rd Qu.:1812156  
 Max.   :2527.0   Max.   :2498.3   Max.   :2525   Max.   :6207027  

mean of closing price Using the mean closing price can serve as a basic reference point or a simple benchmark for forecasting future stock prices. The mean closing price is the average price at which a stock has closed over a specific period.

mean(dataset$close)
[1] 1216.317

variance Code

The concept of variance in the context of closing prices for stock prediction serves to quantify the spread or dispersion of the closing prices around their mean or average value. It provides a measure of how much the actual closing prices deviate from the average closing price over a specific period.

var(dataset$close)
[1] 146944.5

Summaries for all numeric attributes and their outliers and boxplots.

#stastistical measures
#summaries
summary(dataset$close)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  668.3   960.8  1132.5  1216.3  1360.6  2521.6 
summary(dataset$high)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  672.3   968.8  1143.9  1227.4  1374.3  2527.0 
summary(dataset$low)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  663.3   952.2  1117.9  1204.2  1348.6  2498.3 
summary(dataset$open)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    671     959    1131    1215    1361    2525 
summary(dataset$volume)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 346753 1173522 1412588 1601590 1812156 6207027 
summary(dataset$adjClose)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  668.3   960.8  1132.5  1216.3  1360.6  2521.6 
summary(dataset$adjHigh)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  672.3   968.8  1143.9  1227.4  1374.3  2527.0 
summary(dataset$adjLow)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  663.3   952.2  1117.9  1204.2  1348.6  2498.3 
summary(dataset$adjOpen)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    671     959    1131    1215    1361    2525 
summary(dataset$adjVolume)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 346753 1173522 1412588 1601590 1812156 6207027 
#outliers
boxplot.stats(dataset$close)$out
 [1] 2070.07 2062.37 2098.00 2092.91 2083.51 2095.38 2095.89 2104.11
 [9] 2121.90 2128.31 2117.20 2101.14 2064.88 2070.86 2095.17 2031.36
[17] 2036.86 2081.51 2075.84 2026.71 2049.09 2108.54 2024.17 2052.70
[25] 2055.03 2114.77 2061.92 2066.49 2092.52 2091.08 2036.22 2043.20
[33] 2038.59 2052.96 2045.06 2044.36 2035.55 2055.95 2055.54 2068.63
[41] 2137.75 2225.55 2224.75 2249.68 2265.44 2285.88 2254.79 2267.27
[49] 2254.84 2296.66 2297.76 2302.40 2293.63 2293.29 2267.92 2315.30
[57] 2326.74 2307.12 2379.91 2429.89 2410.12 2395.17 2354.25 2356.74
[65] 2381.35 2398.69 2341.66 2308.76 2239.08 2261.97 2316.16 2321.41
[73] 2303.43 2308.71 2356.09 2345.10 2406.67 2409.07 2433.53 2402.51
[81] 2411.56 2429.81 2421.28 2404.61 2451.76 2466.09 2482.85 2491.40
[89] 2521.60 2513.93
boxplot.stats(dataset$high)$out
 [1] 2116.500 2078.550 2102.510 2123.547 2105.130 2108.370 2102.030
 [8] 2108.820 2152.680 2133.660 2132.735 2130.530 2091.420 2082.010
[15] 2100.780 2094.880 2071.010 2086.520 2104.370 2088.518 2089.240
[22] 2118.110 2128.810 2078.040 2075.000 2125.700 2090.260 2067.060
[29] 2123.560 2109.780 2075.500 2053.100 2057.990 2072.302 2078.210
[36] 2058.870 2050.990 2058.430 2070.780 2093.327 2142.940 2237.310
[43] 2237.660 2255.000 2284.005 2289.040 2275.320 2277.210 2277.990
[50] 2306.597 2306.440 2318.450 2309.600 2295.320 2303.762 2325.820
[57] 2341.260 2337.450 2452.378 2436.520 2427.140 2419.700 2379.260
[64] 2382.200 2382.710 2416.410 2378.000 2322.000 2285.370 2276.601
[71] 2321.140 2323.340 2343.150 2316.760 2360.340 2369.000 2418.480
[78] 2432.890 2442.944 2440.000 2428.140 2437.971 2442.000 2409.745
[85] 2453.859 2468.000 2494.495 2505.000 2523.260 2526.990
boxplot.stats(dataset$low)$out
 [1] 2018.380 2042.590 2059.330 2072.000 2078.540 2063.090 2077.320
 [8] 2083.130 2104.360 2098.920 2103.710 2097.410 2062.140 2002.020
[15] 2038.130 2021.290 2016.060 2046.100 2071.260 2010.000 2020.270
[22] 2046.415 2021.610 2047.830 2033.370 2072.380 2047.550 2043.510
[29] 2070.000 2054.000 2033.550 2017.680 2026.070 2039.220 2041.555
[36] 2010.730 2014.020 2015.620 2044.030 2056.745 2096.890 2151.620
[43] 2214.800 2225.330 2257.680 2253.714 2238.465 2256.090 2249.190
[50] 2266.000 2284.450 2287.845 2271.710 2258.570 2256.450 2278.210
[57] 2313.840 2304.270 2374.850 2402.280 2402.160 2384.500 2311.700
[64] 2351.410 2342.338 2390.000 2334.730 2283.000 2230.050 2242.720
[71] 2283.320 2295.000 2303.160 2263.520 2321.090 2342.370 2360.110
[78] 2402.990 2412.515 2402.000 2407.690 2404.880 2404.200 2382.830
[85] 2417.770 2441.073 2468.240 2487.330 2494.000 2498.290
boxplot.stats(dataset$open)$out
 [1] 2073.000 2068.890 2070.000 2105.910 2078.540 2094.210 2099.510
 [8] 2090.250 2104.360 2100.000 2110.390 2119.270 2067.000 2025.010
[15] 2041.830 2067.450 2050.520 2056.520 2076.190 2067.210 2023.370
[22] 2073.120 2101.130 2070.000 2071.760 2074.060 2085.000 2062.300
[29] 2078.990 2076.030 2061.000 2042.050 2041.840 2051.700 2065.370
[36] 2044.810 2038.860 2027.880 2057.630 2059.120 2097.950 2152.940
[43] 2222.500 2226.130 2277.960 2256.700 2266.250 2261.470 2275.160
[50] 2276.980 2303.000 2291.980 2307.890 2285.250 2293.230 2283.470
[57] 2319.930 2336.000 2407.145 2410.330 2404.490 2402.720 2369.740
[64] 2368.420 2350.640 2400.000 2374.890 2291.860 2261.710 2261.090
[71] 2291.830 2309.320 2336.906 2264.400 2328.040 2365.990 2367.000
[78] 2420.000 2412.835 2436.940 2421.960 2422.000 2435.310 2395.020
[85] 2422.520 2451.320 2479.900 2499.500 2494.010 2524.920
boxplot.stats(dataset$volume)$out
 [1] 3402357 4449022 3530169 3841482 4269902 4745183 3654385 3017947
 [9] 2973891 2965771 3246573 3487056 3160585 3270248 3731589 2921393
[17] 3248393 4626086 3095263 5125791 3142760 4758496 3336352 3360727
[25] 3267883 3029471 3369275 4760260 3088305 3318204 4405584 2950120
[33] 4187586 3880723 3212657 4595891 3552194 6207027 5130576 2833483
[41] 4805752 3316905 3055216 3932954 2867053 2978300 3790618 3365365
[49] 4226748 3700125 4252365 3861489 4233435 3651106 3601750 4044137
[57] 3344450 4081528 3573755 3208495 2951309 3793630 3157875 4267698
[65] 3429036 3581072 3107763 3103882 2888827 4330862 3570927 4016353
[73] 4118170 2986439
boxplot.stats(dataset$adjClose)$out
 [1] 2070.07 2062.37 2098.00 2092.91 2083.51 2095.38 2095.89 2104.11
 [9] 2121.90 2128.31 2117.20 2101.14 2064.88 2070.86 2095.17 2031.36
[17] 2036.86 2081.51 2075.84 2026.71 2049.09 2108.54 2024.17 2052.70
[25] 2055.03 2114.77 2061.92 2066.49 2092.52 2091.08 2036.22 2043.20
[33] 2038.59 2052.96 2045.06 2044.36 2035.55 2055.95 2055.54 2068.63
[41] 2137.75 2225.55 2224.75 2249.68 2265.44 2285.88 2254.79 2267.27
[49] 2254.84 2296.66 2297.76 2302.40 2293.63 2293.29 2267.92 2315.30
[57] 2326.74 2307.12 2379.91 2429.89 2410.12 2395.17 2354.25 2356.74
[65] 2381.35 2398.69 2341.66 2308.76 2239.08 2261.97 2316.16 2321.41
[73] 2303.43 2308.71 2356.09 2345.10 2406.67 2409.07 2433.53 2402.51
[81] 2411.56 2429.81 2421.28 2404.61 2451.76 2466.09 2482.85 2491.40
[89] 2521.60 2513.93
boxplot.stats(dataset$adjHigh)$out
 [1] 2116.500 2078.550 2102.510 2123.547 2105.130 2108.370 2102.030
 [8] 2108.820 2152.680 2133.660 2132.735 2130.530 2091.420 2082.010
[15] 2100.780 2094.880 2071.010 2086.520 2104.370 2088.518 2089.240
[22] 2118.110 2128.810 2078.040 2075.000 2125.700 2090.260 2067.060
[29] 2123.560 2109.780 2075.500 2053.100 2057.990 2072.302 2078.210
[36] 2058.870 2050.990 2058.430 2070.780 2093.327 2142.940 2237.310
[43] 2237.660 2255.000 2284.005 2289.040 2275.320 2277.210 2277.990
[50] 2306.597 2306.440 2318.450 2309.600 2295.320 2303.762 2325.820
[57] 2341.260 2337.450 2452.378 2436.520 2427.140 2419.700 2379.260
[64] 2382.200 2382.710 2416.410 2378.000 2322.000 2285.370 2276.601
[71] 2321.140 2323.340 2343.150 2316.760 2360.340 2369.000 2418.480
[78] 2432.890 2442.944 2440.000 2428.140 2437.971 2442.000 2409.745
[85] 2453.859 2468.000 2494.495 2505.000 2523.260 2526.990
boxplot.stats(dataset$adjLow)$out
 [1] 2018.380 2042.590 2059.330 2072.000 2078.540 2063.090 2077.320
 [8] 2083.130 2104.360 2098.920 2103.710 2097.410 2062.140 2002.020
[15] 2038.130 2021.290 2016.060 2046.100 2071.260 2010.000 2020.270
[22] 2046.415 2021.610 2047.830 2033.370 2072.380 2047.550 2043.510
[29] 2070.000 2054.000 2033.550 2017.680 2026.070 2039.220 2041.555
[36] 2010.730 2014.020 2015.620 2044.030 2056.745 2096.890 2151.620
[43] 2214.800 2225.330 2257.680 2253.714 2238.465 2256.090 2249.190
[50] 2266.000 2284.450 2287.845 2271.710 2258.570 2256.450 2278.210
[57] 2313.840 2304.270 2374.850 2402.280 2402.160 2384.500 2311.700
[64] 2351.410 2342.338 2390.000 2334.730 2283.000 2230.050 2242.720
[71] 2283.320 2295.000 2303.160 2263.520 2321.090 2342.370 2360.110
[78] 2402.990 2412.515 2402.000 2407.690 2404.880 2404.200 2382.830
[85] 2417.770 2441.073 2468.240 2487.330 2494.000 2498.290
boxplot.stats(dataset$adjOpen)$out
 [1] 2073.000 2068.890 2070.000 2105.910 2078.540 2094.210 2099.510
 [8] 2090.250 2104.360 2100.000 2110.390 2119.270 2067.000 2025.010
[15] 2041.830 2067.450 2050.520 2056.520 2076.190 2067.210 2023.370
[22] 2073.120 2101.130 2070.000 2071.760 2074.060 2085.000 2062.300
[29] 2078.990 2076.030 2061.000 2042.050 2041.840 2051.700 2065.370
[36] 2044.810 2038.860 2027.880 2057.630 2059.120 2097.950 2152.940
[43] 2222.500 2226.130 2277.960 2256.700 2266.250 2261.470 2275.160
[50] 2276.980 2303.000 2291.980 2307.890 2285.250 2293.230 2283.470
[57] 2319.930 2336.000 2407.145 2410.330 2404.490 2402.720 2369.740
[64] 2368.420 2350.640 2400.000 2374.890 2291.860 2261.710 2261.090
[71] 2291.830 2309.320 2336.906 2264.400 2328.040 2365.990 2367.000
[78] 2420.000 2412.835 2436.940 2421.960 2422.000 2435.310 2395.020
[85] 2422.520 2451.320 2479.900 2499.500 2494.010 2524.920
boxplot.stats(dataset$adjVolume)$out
 [1] 3402357 4449022 3530169 3841482 4269902 4745183 3654385 3017947
 [9] 2973891 2965771 3246573 3487056 3160585 3270248 3731589 2921393
[17] 3248393 4626086 3095263 5125791 3142760 4758496 3336352 3360727
[25] 3267883 3029471 3369275 4760260 3088305 3318204 4405584 2950120
[33] 4187586 3880723 3212657 4595891 3552194 6207027 5130576 2833483
[41] 4805752 3316905 3055216 3932954 2867053 2978300 3790618 3365365
[49] 4226748 3700125 4252365 3861489 4233435 3651106 3601750 4044137
[57] 3344450 4081528 3573755 3208495 2951309 3793630 3157875 4267698
[65] 3429036 3581072 3107763 3103882 2888827 4330862 3570927 4016353
[73] 4118170 2986439
#boxplots
boxplot(dataset$close)

boxplot(dataset$high)

boxplot(dataset$low)

boxplot(dataset$open)

boxplot(dataset$volume)

boxplot(dataset$adjClose)

boxplot(dataset$adjHigh)

boxplot(dataset$adjLow)

boxplot(dataset$adjOpen)

boxplot(dataset$adjVolume)

This scatter plot helps us to determine whether the closing price and volume are correlated to each other or not, it shows that the two attributes are corelated and have proportional relationship.

with(dataset, plot(volume, close))

The Bar plot represents the closing price and date in dataset. It indicates that closing prices at the end of a traded day are increasing or decreasing depending on the date.

barplot(height = dataset$close, names.arg = dataset$date, xlab = "Date", ylab = "Closing price", main = "date vs Close")

This Histogram represents the frequency of a stock closing price in the dataset. After observation, we noticed that the most values lie in between 1000 to 1200.

hist(dataset$close)

4. Data preprocessing

Here is our data set before preprocessing

#dataset before preprocessing
print(dataset)

Data cleaning, including handling missing values like NULLs, is crucial before utilizing data for analysis or modeling. It’s important to get the best quality of analysis. Such as accuracy where missing or incorrect data can skew analysis, leading to inaccurate insights or predictions. And clean data ensures the reliability of your findings, reducing the risk of making decisions based on flawed information.

to find the total null values in the dataset #Checking NULL, FALSE means no null, TRUE cells means the value of the cell is null

is.na(dataset)
         date close  high   low  open volume adjClose adjHigh
   [1,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [2,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [3,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [4,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [5,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [6,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [7,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [8,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
   [9,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [10,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [11,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [12,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [13,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [14,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [15,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [16,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [17,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [18,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [19,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [20,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [21,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [22,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [23,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [24,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [25,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [26,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [27,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [28,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [29,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [30,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [31,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [32,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [33,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [34,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [35,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [36,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [37,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [38,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [39,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [40,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [41,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [42,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [43,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [44,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [45,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [46,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [47,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [48,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [49,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [50,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [51,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [52,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [53,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [54,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [55,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [56,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [57,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [58,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [59,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [60,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [61,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [62,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [63,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [64,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [65,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [66,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [67,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [68,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [69,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [70,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [71,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [72,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [73,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [74,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [75,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [76,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [77,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [78,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [79,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [80,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [81,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [82,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [83,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [84,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [85,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [86,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [87,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [88,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [89,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
  [90,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE
        adjLow adjOpen adjVolume
   [1,]  FALSE   FALSE     FALSE
   [2,]  FALSE   FALSE     FALSE
   [3,]  FALSE   FALSE     FALSE
   [4,]  FALSE   FALSE     FALSE
   [5,]  FALSE   FALSE     FALSE
   [6,]  FALSE   FALSE     FALSE
   [7,]  FALSE   FALSE     FALSE
   [8,]  FALSE   FALSE     FALSE
   [9,]  FALSE   FALSE     FALSE
  [10,]  FALSE   FALSE     FALSE
  [11,]  FALSE   FALSE     FALSE
  [12,]  FALSE   FALSE     FALSE
  [13,]  FALSE   FALSE     FALSE
  [14,]  FALSE   FALSE     FALSE
  [15,]  FALSE   FALSE     FALSE
  [16,]  FALSE   FALSE     FALSE
  [17,]  FALSE   FALSE     FALSE
  [18,]  FALSE   FALSE     FALSE
  [19,]  FALSE   FALSE     FALSE
  [20,]  FALSE   FALSE     FALSE
  [21,]  FALSE   FALSE     FALSE
  [22,]  FALSE   FALSE     FALSE
  [23,]  FALSE   FALSE     FALSE
  [24,]  FALSE   FALSE     FALSE
  [25,]  FALSE   FALSE     FALSE
  [26,]  FALSE   FALSE     FALSE
  [27,]  FALSE   FALSE     FALSE
  [28,]  FALSE   FALSE     FALSE
  [29,]  FALSE   FALSE     FALSE
  [30,]  FALSE   FALSE     FALSE
  [31,]  FALSE   FALSE     FALSE
  [32,]  FALSE   FALSE     FALSE
  [33,]  FALSE   FALSE     FALSE
  [34,]  FALSE   FALSE     FALSE
  [35,]  FALSE   FALSE     FALSE
  [36,]  FALSE   FALSE     FALSE
  [37,]  FALSE   FALSE     FALSE
  [38,]  FALSE   FALSE     FALSE
  [39,]  FALSE   FALSE     FALSE
  [40,]  FALSE   FALSE     FALSE
  [41,]  FALSE   FALSE     FALSE
  [42,]  FALSE   FALSE     FALSE
  [43,]  FALSE   FALSE     FALSE
  [44,]  FALSE   FALSE     FALSE
  [45,]  FALSE   FALSE     FALSE
  [46,]  FALSE   FALSE     FALSE
  [47,]  FALSE   FALSE     FALSE
  [48,]  FALSE   FALSE     FALSE
  [49,]  FALSE   FALSE     FALSE
  [50,]  FALSE   FALSE     FALSE
  [51,]  FALSE   FALSE     FALSE
  [52,]  FALSE   FALSE     FALSE
  [53,]  FALSE   FALSE     FALSE
  [54,]  FALSE   FALSE     FALSE
  [55,]  FALSE   FALSE     FALSE
  [56,]  FALSE   FALSE     FALSE
  [57,]  FALSE   FALSE     FALSE
  [58,]  FALSE   FALSE     FALSE
  [59,]  FALSE   FALSE     FALSE
  [60,]  FALSE   FALSE     FALSE
  [61,]  FALSE   FALSE     FALSE
  [62,]  FALSE   FALSE     FALSE
  [63,]  FALSE   FALSE     FALSE
  [64,]  FALSE   FALSE     FALSE
  [65,]  FALSE   FALSE     FALSE
  [66,]  FALSE   FALSE     FALSE
  [67,]  FALSE   FALSE     FALSE
  [68,]  FALSE   FALSE     FALSE
  [69,]  FALSE   FALSE     FALSE
  [70,]  FALSE   FALSE     FALSE
  [71,]  FALSE   FALSE     FALSE
  [72,]  FALSE   FALSE     FALSE
  [73,]  FALSE   FALSE     FALSE
  [74,]  FALSE   FALSE     FALSE
  [75,]  FALSE   FALSE     FALSE
  [76,]  FALSE   FALSE     FALSE
  [77,]  FALSE   FALSE     FALSE
  [78,]  FALSE   FALSE     FALSE
  [79,]  FALSE   FALSE     FALSE
  [80,]  FALSE   FALSE     FALSE
  [81,]  FALSE   FALSE     FALSE
  [82,]  FALSE   FALSE     FALSE
  [83,]  FALSE   FALSE     FALSE
  [84,]  FALSE   FALSE     FALSE
  [85,]  FALSE   FALSE     FALSE
  [86,]  FALSE   FALSE     FALSE
  [87,]  FALSE   FALSE     FALSE
  [88,]  FALSE   FALSE     FALSE
  [89,]  FALSE   FALSE     FALSE
  [90,]  FALSE   FALSE     FALSE
 [ reached getOption("max.print") -- omitted 1168 rows ]
sum(is.na(dataset))
[1] 0
print("Since there is no NULL values we don't need to remove any rows")
[1] "Since there is no NULL values we don't need to remove any rows"

In our data since there are no Null values, we don’t need to remove any rows.

Since most attributes in our dataset are numeric and removing outliers will affect our calculations and prediction, we will remove closing price and volumes outliers only.

#dataset before removing outliers
print(dataset)
summary(dataset)
      date                close             high       
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3  
 1st Qu.:2017-09-12   1st Qu.: 960.8   1st Qu.: 968.8  
 Median :2018-12-11   Median :1132.5   Median :1143.9  
 Mean   :2018-12-12   Mean   :1216.3   Mean   :1227.4  
 3rd Qu.:2020-03-12   3rd Qu.:1360.6   3rd Qu.:1374.3  
 Max.   :2021-06-11   Max.   :2521.6   Max.   :2527.0  
      low              open          volume           adjClose     
 Min.   : 663.3   Min.   : 671   Min.   : 346753   Min.   : 668.3  
 1st Qu.: 952.2   1st Qu.: 959   1st Qu.:1173522   1st Qu.: 960.8  
 Median :1117.9   Median :1131   Median :1412588   Median :1132.5  
 Mean   :1204.2   Mean   :1215   Mean   :1601590   Mean   :1216.3  
 3rd Qu.:1348.6   3rd Qu.:1361   3rd Qu.:1812156   3rd Qu.:1360.6  
 Max.   :2498.3   Max.   :2525   Max.   :6207027   Max.   :2521.6  
    adjHigh           adjLow          adjOpen       adjVolume      
 Min.   : 672.3   Min.   : 663.3   Min.   : 671   Min.   : 346753  
 1st Qu.: 968.8   1st Qu.: 952.2   1st Qu.: 959   1st Qu.:1173522  
 Median :1143.9   Median :1117.9   Median :1131   Median :1412588  
 Mean   :1227.4   Mean   :1204.2   Mean   :1215   Mean   :1601590  
 3rd Qu.:1374.3   3rd Qu.:1348.6   3rd Qu.:1361   3rd Qu.:1812156  
 Max.   :2527.0   Max.   :2498.3   Max.   :2525   Max.   :6207027  
str(dataset)
'data.frame':   1258 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : num  718 719 710 692 694 ...
 $ high     : num  722 723 717 709 702 ...
 $ low      : num  713 717 703 688 693 ...
 $ open     : num  716 719 715 709 699 ...
 $ volume   : int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
 $ adjClose : num  718 719 710 692 694 ...
 $ adjHigh  : num  722 723 717 709 702 ...
 $ adjLow   : num  713 717 703 688 693 ...
 $ adjOpen  : num  716 719 715 709 699 ...
 $ adjVolume: int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
#removing close outlier
outliers <- boxplot(dataset$close, plot=FALSE)$out
dataset <- dataset[-which(dataset$close %in% outliers),]
boxplot.stats(dataset$close)$out
 [1] 1749.13 1763.37 1761.75 1763.00 1752.71 1749.84 1777.02 1781.38
 [9] 1770.15 1746.78 1763.92 1768.88 1771.43 1793.19 1760.74 1798.10
[17] 1827.95 1826.77 1827.99 1819.48 1818.55 1784.13 1775.33 1781.77
[25] 1760.06 1767.77 1763.00 1747.90 1776.09 1758.72 1751.88 1787.25
[33] 1807.21 1766.72 1746.55 1754.40 1790.86 1886.90 1891.25 1901.05
[41] 1899.40 1917.24 1830.79 1863.11 1835.74 1901.35 1927.51
#removing volume's outlier
outliers <- boxplot(dataset$volume, plot=FALSE)$out
dataset <- dataset[-which(dataset$volume %in% outliers),]
boxplot.stats(dataset$volume)$out
 [1] 2641085 2700470 2749221 2607121 2553771 2712222 2634669 2720942
 [9] 2560277 2580374 2558385 2726830 2680400 2619234 2675742 2580612
[17] 2769225 2673464 2576470 2642983 2597455 2561288 2660628 2611373
[25] 2611229 2574061 2664723 2668906 2608568 2610884 2568345 2636142
[33] 2602114 2748292
#data set after removing outliers
print(dataset)
summary(dataset)
      date                close             high       
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3  
 1st Qu.:2017-08-10   1st Qu.: 942.2   1st Qu.: 943.8  
 Median :2018-09-29   Median :1115.7   Median :1125.6  
 Mean   :2018-10-03   Mean   :1139.9   Mean   :1149.5  
 3rd Qu.:2019-11-14   3rd Qu.:1264.7   3rd Qu.:1275.7  
 Max.   :2021-02-02   Max.   :1927.5   Max.   :1955.8  
      low              open            volume       
 Min.   : 663.3   Min.   : 671.0   Min.   : 346753  
 1st Qu.: 933.8   1st Qu.: 939.7   1st Qu.:1167344  
 Median :1104.2   Median :1115.8   Median :1394116  
 Mean   :1129.1   Mean   :1138.6   Mean   :1480717  
 3rd Qu.:1251.1   3rd Qu.:1262.0   3rd Qu.:1719968  
 Max.   :1914.5   Max.   :1922.6   Max.   :2769225  
    adjClose         adjHigh           adjLow      
 Min.   : 668.3   Min.   : 672.3   Min.   : 663.3  
 1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8  
 Median :1115.7   Median :1125.6   Median :1104.2  
 Mean   :1139.9   Mean   :1149.5   Mean   :1129.1  
 3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1  
 Max.   :1927.5   Max.   :1955.8   Max.   :1914.5  
    adjOpen         adjVolume      
 Min.   : 671.0   Min.   : 346753  
 1st Qu.: 939.7   1st Qu.:1167344  
 Median :1115.8   Median :1394116  
 Mean   :1138.6   Mean   :1480717  
 3rd Qu.:1262.0   3rd Qu.:1719968  
 Max.   :1922.6   Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : num  718 719 710 694 696 ...
 $ high     : num  722 723 717 702 703 ...
 $ low      : num  713 717 703 693 692 ...
 $ open     : num  716 719 715 699 698 ...
 $ volume   : int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

Feature selection

Remove Redundant Features

# load the library        
library(mlbench)
Warning: package ‘mlbench’ was built under R version 4.3.2
library(caret)
library(ggplot2)
library(lattice)

# calculate correlation matrix
correlationMatrix <- cor(dataset[,2:11])

# summarize the correlation matrix
print(correlationMatrix)
              close      high       low      open    volume
close     1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
high      0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
low       0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
open      0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
volume    0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
adjClose  1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
adjHigh   0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
adjLow    0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
adjOpen   0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
adjVolume 0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
           adjClose   adjHigh    adjLow   adjOpen adjVolume
close     1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
high      0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
low       0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
open      0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
volume    0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
adjClose  1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
adjHigh   0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
adjLow    0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
adjOpen   0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
adjVolume 0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5 )

# print indexes of highly correlated attributes
print(highlyCorrelated)
[1] 7 2 4 9 1 6 8 5

dataset before normalization

#dataset before normalization 
print(dataset)
summary(dataset)
      date                close             high       
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3  
 1st Qu.:2017-08-10   1st Qu.: 942.2   1st Qu.: 943.8  
 Median :2018-09-29   Median :1115.7   Median :1125.6  
 Mean   :2018-10-03   Mean   :1139.9   Mean   :1149.5  
 3rd Qu.:2019-11-14   3rd Qu.:1264.7   3rd Qu.:1275.7  
 Max.   :2021-02-02   Max.   :1927.5   Max.   :1955.8  
      low              open            volume       
 Min.   : 663.3   Min.   : 671.0   Min.   : 346753  
 1st Qu.: 933.8   1st Qu.: 939.7   1st Qu.:1167344  
 Median :1104.2   Median :1115.8   Median :1394116  
 Mean   :1129.1   Mean   :1138.6   Mean   :1480717  
 3rd Qu.:1251.1   3rd Qu.:1262.0   3rd Qu.:1719968  
 Max.   :1914.5   Max.   :1922.6   Max.   :2769225  
    adjClose         adjHigh           adjLow      
 Min.   : 668.3   Min.   : 672.3   Min.   : 663.3  
 1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8  
 Median :1115.7   Median :1125.6   Median :1104.2  
 Mean   :1139.9   Mean   :1149.5   Mean   :1129.1  
 3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1  
 Max.   :1927.5   Max.   :1955.8   Max.   :1914.5  
    adjOpen         adjVolume      
 Min.   : 671.0   Min.   : 346753  
 1st Qu.: 939.7   1st Qu.:1167344  
 Median :1115.8   Median :1394116  
 Mean   :1138.6   Mean   :1480717  
 3rd Qu.:1262.0   3rd Qu.:1719968  
 Max.   :1922.6   Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : num  718 719 710 694 696 ...
 $ high     : num  722 723 717 702 703 ...
 $ low      : num  713 717 703 693 692 ...
 $ open     : num  716 719 715 699 698 ...
 $ volume   : int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

normalization was performed to ensure consistent scaling of the data. The normalization technique applied was the max-min normalization. This technique rescales the values of specific attributes within a defined range between 0 and 1.

We can use the normalized dataset provides a more uniform and comparable representation of the attributes, enabling accurate analysis and modeling for stock predaction with result as shown.

normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataWithoutNormalization <- dataset
dataset$close<-normalize(dataWithoutNormalization$close)
dataset$volume<-normalize(dataWithoutNormalization$volume)
dataset$open<-normalize(dataWithoutNormalization$open)
dataset$low <-normalize(dataWithoutNormalization$low)
dataset$high <-normalize(dataWithoutNormalization$high)

dataset after normalization

#dataset after normalization 
print(dataset)
summary(dataset)
      date                close             high       
 Min.   :2016-06-14   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   1st Qu.:0.2175   1st Qu.:0.2115  
 Median :2018-09-29   Median :0.3553   Median :0.3532  
 Mean   :2018-10-03   Mean   :0.3746   Mean   :0.3718  
 3rd Qu.:2019-11-14   3rd Qu.:0.4736   3rd Qu.:0.4701  
 Max.   :2021-02-02   Max.   :1.0000   Max.   :1.0000  
      low              open            volume      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.2162   1st Qu.:0.2147   1st Qu.:0.3387  
 Median :0.3524   Median :0.3554   Median :0.4324  
 Mean   :0.3723   Mean   :0.3736   Mean   :0.4681  
 3rd Qu.:0.4698   3rd Qu.:0.4722   3rd Qu.:0.5669  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    adjClose         adjHigh           adjLow      
 Min.   : 668.3   Min.   : 672.3   Min.   : 663.3  
 1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8  
 Median :1115.7   Median :1125.6   Median :1104.2  
 Mean   :1139.9   Mean   :1149.5   Mean   :1129.1  
 3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1  
 Max.   :1927.5   Max.   :1955.8   Max.   :1914.5  
    adjOpen         adjVolume      
 Min.   : 671.0   Min.   : 346753  
 1st Qu.: 939.7   1st Qu.:1167344  
 Median :1115.8   Median :1394116  
 Mean   :1138.6   Mean   :1480717  
 3rd Qu.:1262.0   3rd Qu.:1719968  
 Max.   :1922.6   Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : num  0.0397 0.0402 0.0334 0.0202 0.022 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

dataset before Discretization

#dataset before Discretization 
print(dataset)
summary(dataset)
      date                close             high       
 Min.   :2016-06-14   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   1st Qu.:0.2175   1st Qu.:0.2115  
 Median :2018-09-29   Median :0.3553   Median :0.3532  
 Mean   :2018-10-03   Mean   :0.3746   Mean   :0.3718  
 3rd Qu.:2019-11-14   3rd Qu.:0.4736   3rd Qu.:0.4701  
 Max.   :2021-02-02   Max.   :1.0000   Max.   :1.0000  
      low              open            volume      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.2162   1st Qu.:0.2147   1st Qu.:0.3387  
 Median :0.3524   Median :0.3554   Median :0.4324  
 Mean   :0.3723   Mean   :0.3736   Mean   :0.4681  
 3rd Qu.:0.4698   3rd Qu.:0.4722   3rd Qu.:0.5669  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    adjClose         adjHigh           adjLow      
 Min.   : 668.3   Min.   : 672.3   Min.   : 663.3  
 1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8  
 Median :1115.7   Median :1125.6   Median :1104.2  
 Mean   :1139.9   Mean   :1149.5   Mean   :1129.1  
 3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1  
 Max.   :1927.5   Max.   :1955.8   Max.   :1914.5  
    adjOpen         adjVolume      
 Min.   : 671.0   Min.   : 346753  
 1st Qu.: 939.7   1st Qu.:1167344  
 Median :1115.8   Median :1394116  
 Mean   :1138.6   Mean   :1480717  
 3rd Qu.:1262.0   3rd Qu.:1719968  
 Max.   :1922.6   Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : num  0.0397 0.0402 0.0334 0.0202 0.022 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

we used the Discretization technique on our class label “close” to simplify it as it has a large continuous values, we made them fall into intervals, to make it easier to analyze

and we chose the value 0.2957251 as it the mean value for the closing

dataset$close <- ifelse(dataset$close <= 0.2957251 , "low","High")
print(dataset)

we discretized it into two categories (low, high) based on the maen, low meaning it is less than the mean of the close , and high meaning it is equal to or higher than the mean.

Encoding We encoded close data into factors, which would help the model read this data easily


dataset$close <- factor(dataset$close,levels = c("low", "High"), labels = c("1", "2"))

print(dataset)

dataset after Discretization

#dataset after Discretization 
print(dataset)
summary(dataset)
      date            close        high             low        
 Min.   :2016-06-14   1:396   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   2:700   1st Qu.:0.2115   1st Qu.:0.2162  
 Median :2018-09-29           Median :0.3532   Median :0.3524  
 Mean   :2018-10-03           Mean   :0.3718   Mean   :0.3723  
 3rd Qu.:2019-11-14           3rd Qu.:0.4701   3rd Qu.:0.4698  
 Max.   :2021-02-02           Max.   :1.0000   Max.   :1.0000  
      open            volume          adjClose     
 Min.   :0.0000   Min.   :0.0000   Min.   : 668.3  
 1st Qu.:0.2147   1st Qu.:0.3387   1st Qu.: 942.2  
 Median :0.3554   Median :0.4324   Median :1115.7  
 Mean   :0.3736   Mean   :0.4681   Mean   :1139.9  
 3rd Qu.:0.4722   3rd Qu.:0.5669   3rd Qu.:1264.7  
 Max.   :1.0000   Max.   :1.0000   Max.   :1927.5  
    adjHigh           adjLow          adjOpen      
 Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

summary after preprocessing after preprocessing the data for stock price prediction, several steps are taken to refine, clean, and prepare the data for analysis and modeling. These preprocessing steps aim to enhance the quality and reliability of the data for more accurate stock price prediction.

dataset after preprocessing

#dataset after preprocessing 
print(dataset)
summary(dataset)
      date            close        high             low        
 Min.   :2016-06-14   1:396   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   2:700   1st Qu.:0.2115   1st Qu.:0.2162  
 Median :2018-09-29           Median :0.3532   Median :0.3524  
 Mean   :2018-10-03           Mean   :0.3718   Mean   :0.3723  
 3rd Qu.:2019-11-14           3rd Qu.:0.4701   3rd Qu.:0.4698  
 Max.   :2021-02-02           Max.   :1.0000   Max.   :1.0000  
      open            volume          adjClose     
 Min.   :0.0000   Min.   :0.0000   Min.   : 668.3  
 1st Qu.:0.2147   1st Qu.:0.3387   1st Qu.: 942.2  
 Median :0.3554   Median :0.4324   Median :1115.7  
 Mean   :0.3736   Mean   :0.4681   Mean   :1139.9  
 3rd Qu.:0.4722   3rd Qu.:0.5669   3rd Qu.:1264.7  
 Max.   :1.0000   Max.   :1.0000   Max.   :1927.5  
    adjHigh           adjLow          adjOpen      
 Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" ...
 $ close    : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

Feature selection

Feature selection is a process of selecting a subset of relevant features (or attributes) from the original set of features in a dataset. The goal of feature selection is to choose the most relevant and important features, thereby reducing dimensionality, and improving model performance.

#Feature selection ,Feature selection using Recursive Feature Elimination or RFE

    library(mlbench)
library(caret)

# define the control using a random forest selection function 
# number=12 means the length of the list
control <- rfeControl(functions=rfFuncs, method="cv", number=11)
# run the RFE algorithm from column 1 to 11  
results <- rfe(dataset[,1:10],dataset[,11], sizes=c(1:10), rfeControl=control)

summarize the results

print(results)

Recursive feature selection

Outer resampling method: Cross-Validated (11 fold) 

Resampling performance over subset size:

The top 1 variables (out of 1):
   volume

list the chosen features

predictors(results)
[1] "volume"

plot the results

plot(results, type=c("h", "o"))

5. Data Mining Techniques

We did both supervised and unsupervised learning techniques on our dataset (Google stock prediction), which involves classification and clustering methods, for classification we did a partitioning method called the train-test split, which splits the dataset into two subsets of different ratios, and we implemented three algorithms to form 9 different decision trees.

6. Evaluation and Comparison

We will choose the attributes with the highest importance (from feature selection) to create a tree:

  1. Dividing the dataset:

we divided our dataset into two divisions for each split:

first one 70-30, which means Training(70%) and Testing(30%):

# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(70%) and Testing(30%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c( 0.70, 0.30))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
  1. Determine the predictor attributes and the class label attribute.( the formula):
library(party)    
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
#myFormula 
myFormula <- close ~volume+open+high+low
  1. Build a decision tree using Information gain:

Information gain is a concept used in the field of machine learning and decision tree algorithms. It is a measure of the effectiveness of a particular attribute in classifying data. In the context of decision trees, information gain helps determine the order in which attributes are chosen for splitting the data.

dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
   
      1   2
  1 284  11
  2   0 476
# 4.Print and plot the tree:

print(dataset_ctree)

     Conditional inference tree with 4 terminal nodes

Response:  close 
Inputs:  volume, open, high, low 
Number of observations:  771 

1) open <= 0.2974608; criterion = 1, statistic = 423.273
  2) high <= 0.2892353; criterion = 1, statistic = 19.817
    3)*  weights = 267 
  2) high > 0.2892353
    4)*  weights = 17 
1) open > 0.2974608
  5) low <= 0.2955676; criterion = 0.995, statistic = 10.36
    6)*  weights = 11 
  5) low > 0.2955676
    7)*  weights = 476 
plot(dataset_ctree, type="simple")

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 111   3
       2   1 210
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/e1071_1.7-13.zip'
Content type 'application/zip' length 653332 bytes (638 KB)
downloaded 638 KB
package ‘e1071’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\shade\AppData\Local\Temp\RtmpIFXKiG\downloaded_packages
library(e1071)
Warning: package ‘e1071’ was built under R version 4.3.2
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 111   3
       2   1 210
                                          
               Accuracy : 0.9877          
                 95% CI : (0.9688, 0.9966)
    No Information Rate : 0.6554          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9729          
                                          
 Mcnemar's Test P-Value : 0.6171          
                                          
            Sensitivity : 0.9911          
            Specificity : 0.9859          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 0.9953          
             Prevalence : 0.3446          
         Detection Rate : 0.3415          
   Detection Prevalence : 0.3508          
      Balanced Accuracy : 0.9885          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9910714
specificity(as.table(co_result))
[1] 0.9859155
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9876923 
  1. Building the Tree using Gini Index(CART)

The Gini Index is another criterion used in decision tree algorithms, particularly in the context of the Classification and Regression Trees (CART) algorithm. Like information gain, the Gini Index is used to evaluate the impurity or homogeneity of a dataset.

The Gini Index for a specific attribute measures the probability of incorrectly classifying a randomly chosen element in the dataset. A lower Gini Index indicates a purer or more homogeneous set. In the context of decision trees, the attribute with the lowest Gini Index is chosen as the split attribute.

# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)
Warning: package ‘rpart.plot’ was built under R version 4.3.2
dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))

Visualizing the unpruned tree

library(rpart.plot)
rpart.plot(dataset.cart)

Checking the order of variable importance

dataset.cart$variable.importance
       low       high       open     volume 
343.117705 330.102896 326.553402   4.732658 
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
         
pred.tree   1   2
        1 111   3
        2   1 210
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 111   3
       2   1 210
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 111   3
       2   1 210
                                          
               Accuracy : 0.9877          
                 95% CI : (0.9688, 0.9966)
    No Information Rate : 0.6554          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9729          
                                          
 Mcnemar's Test P-Value : 0.6171          
                                          
            Sensitivity : 0.9911          
            Specificity : 0.9859          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 0.9953          
             Prevalence : 0.3446          
         Detection Rate : 0.3415          
   Detection Prevalence : 0.3508          
      Balanced Accuracy : 0.9885          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9910714
specificity(as.table(co_result))
[1] 0.9859155
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9876923 
  1. Building the Tree using Gain ratio(C5)

The Gain Ratio is used to select the attribute that maximizes the Information Gain while avoiding the bias towards attributes with many values. It provides a more balanced measure for attribute selection in decision tree construction.

While Information Gain simply measures the reduction in entropy or uncertainty, Gain Ratio takes into account the intrinsic information of an attribute. It aims to penalize attributes that may have a large number of values, potentially leading to overfitting.

install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/C50_0.1.8.zip'
Content type 'application/zip' length 342652 bytes (334 KB)
downloaded 334 KB
package ‘C50’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\shade\AppData\Local\Temp\RtmpIFXKiG\downloaded_packages
install.packages("printr")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/printr_0.3.zip'
Content type 'application/zip' length 39419 bytes (38 KB)
downloaded 38 KB
package ‘printr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\shade\AppData\Local\Temp\RtmpIFXKiG\downloaded_packages
library(C50)
Warning: package ‘C50’ was built under R version 4.3.2
library(printr)
Warning: package ‘printr’ was built under R version 4.3.2Registered S3 method overwritten by 'printr':
  method                from     
  knit_print.data.frame rmarkdown
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)

Call:
C5.0.formula(formula = myFormula, data = trainData)


C5.0 [Release 2.07 GPL Edition]     Fri Dec  1 06:15:18 2023
-------------------------------

Class specified by attribute `outcome'

Read 771 cases (5 attributes) from undefined.data

Decision tree:

low > 0.2960392: 2 (481/1)
low <= 0.2960392:
:...high <= 0.2892354: 1 (266)
    high > 0.2892354:
    :...high > 0.3075281: 2 (2)
        high <= 0.3075281:
        :...open <= 0.278852: 2 (2)
            open > 0.278852: 1 (20/3)


Evaluation on training data (771 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         5    4( 0.5%)   <<


       (a)   (b)    <-classified as
      ----  ----
       283     1    (a): class 1
         3   484    (b): class 2


    Attribute usage:

    100.00% low
     37.61% high
      2.85% open


Time: 0.0 secs
plot(CloseTree)

second one 60-40, which means Training(60%) and Testing(40%):

# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(60%) and Testing(40%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.60 , 0.40))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
  1. Determine the predictor attributes and the class label attribute.( the formula):
library(party)    
#myFormula 
myFormula <- close ~volume+open+high+low
  1. Build a decision tree using training set and check the Prediction:
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
   
      1   2
  1 249   8
  2   0 398
# 4.Print and plot the tree:

print(dataset_ctree)

     Conditional inference tree with 4 terminal nodes

Response:  close 
Inputs:  volume, open, high, low 
Number of observations:  655 

1) open <= 0.2974608; criterion = 1, statistic = 363.998
  2) high <= 0.2892353; criterion = 0.998, statistic = 11.719
    3)*  weights = 235 
  2) high > 0.2892353
    4)*  weights = 12 
1) open > 0.2974608
  5) low <= 0.2955676; criterion = 0.987, statistic = 8.71
    6)*  weights = 10 
  5) low > 0.2955676
    7)*  weights = 398 
plot(dataset_ctree, type="simple")

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 146   6
       2   1 288
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 146   6
       2   1 288
                                          
               Accuracy : 0.9841          
                 95% CI : (0.9676, 0.9936)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9646          
                                          
 Mcnemar's Test P-Value : 0.1306          
                                          
            Sensitivity : 0.9932          
            Specificity : 0.9796          
         Pos Pred Value : 0.9605          
         Neg Pred Value : 0.9965          
             Prevalence : 0.3333          
         Detection Rate : 0.3311          
   Detection Prevalence : 0.3447          
      Balanced Accuracy : 0.9864          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9795918
precision(as.table(co_result))
[1] 0.9605263
acc <- co_result$overall["Accuracy"]
acc
Accuracy 
0.984127 
  1. Building the Tree using Gini Index(CART)
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))

Visualizing the unpruned tree

rpart.plot(dataset.cart)

Checking the order of variable importance

dataset.cart$variable.importance
       low       high       open     volume 
294.972422 284.520643 282.198025   4.645235 
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
         
pred.tree   1   2
        1 146   4
        2   1 290
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 146   6
       2   1 288
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 146   6
       2   1 288
                                          
               Accuracy : 0.9841          
                 95% CI : (0.9676, 0.9936)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9646          
                                          
 Mcnemar's Test P-Value : 0.1306          
                                          
            Sensitivity : 0.9932          
            Specificity : 0.9796          
         Pos Pred Value : 0.9605          
         Neg Pred Value : 0.9965          
             Prevalence : 0.3333          
         Detection Rate : 0.3311          
   Detection Prevalence : 0.3447          
      Balanced Accuracy : 0.9864          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9795918
precision(as.table(co_result))
[1] 0.9605263
acc <- co_result$overall["Accuracy"]
acc
Accuracy 
0.984127 
  1. Building the Tree using Gain ratio(C5)
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
Error in install.packages : Updating loaded packages
install.packages("printr")
Error in install.packages : Updating loaded packages
library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)

Call:
C5.0.formula(formula = myFormula, data = trainData)


C5.0 [Release 2.07 GPL Edition]     Fri Dec  1 06:15:19 2023
-------------------------------

Class specified by attribute `outcome'

Read 655 cases (5 attributes) from undefined.data

Decision tree:

low <= 0.2960392: 1 (254/6)
low > 0.2960392: 2 (401/1)


Evaluation on training data (655 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         2    7( 1.1%)   <<


       (a)   (b)    <-classified as
      ----  ----
       248     1    (a): class 1
         6   400    (b): class 2


    Attribute usage:

    100.00% low


Time: 0.0 secs
plot(CloseTree)

Third one 80-20, which means Training(80%) and Testing(20%):

# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(80%) and Testing(20%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.80 , 0.20))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]

2.Determine the predictor attributes and the class label attribute.( the formula):

library(party)    
#myFormula 
myFormula <- close ~volume+open+high+low

3.Build a decision tree using training set and check the Prediction:

dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
   
      1   2
  1 322  14
  2   0 535
# 4.Print and plot the tree:

print(dataset_ctree)

     Conditional inference tree with 4 terminal nodes

Response:  close 
Inputs:  volume, open, high, low 
Number of observations:  871 

1) open <= 0.2974608; criterion = 1, statistic = 478.791
  2) high <= 0.2892353; criterion = 1, statistic = 22.684
    3)*  weights = 303 
  2) high > 0.2892353
    4)*  weights = 19 
1) open > 0.2974608
  5) low <= 0.2997876; criterion = 0.997, statistic = 11.651
    6)*  weights = 14 
  5) low > 0.2997876
    7)*  weights = 535 
plot(dataset_ctree, type="simple")

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1  74   2
       2   0 149
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1  74   2
       2   0 149
                                          
               Accuracy : 0.9911          
                 95% CI : (0.9683, 0.9989)
    No Information Rate : 0.6711          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.98            
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.9868          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 1.0000          
             Prevalence : 0.3289          
         Detection Rate : 0.3289          
   Detection Prevalence : 0.3378          
      Balanced Accuracy : 0.9934          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 1
specificity(as.table(co_result))
[1] 0.986755
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9911111 
  1. Building the Tree using Gini Index(CART)
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))

Visualizing the unpruned tree

library(rpart.plot)
rpart.plot(dataset.cart)

Checking the order of variable importance

dataset.cart$variable.importance
       low       high       open     volume 
386.324609 371.012963 368.657326   4.711276 
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
         
pred.tree   1   2
        1  74   2
        2   0 149
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1  74   2
       2   0 149
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1  74   2
       2   0 149
                                          
               Accuracy : 0.9911          
                 95% CI : (0.9683, 0.9989)
    No Information Rate : 0.6711          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.98            
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.9868          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 1.0000          
             Prevalence : 0.3289          
         Detection Rate : 0.3289          
   Detection Prevalence : 0.3378          
      Balanced Accuracy : 0.9934          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 1
specificity(as.table(co_result))
[1] 0.986755
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9911111 
  1. Building the Tree using Gain ratio(C5)
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
Error in install.packages : Updating loaded packages
install.packages("printr")
Error in install.packages : Updating loaded packages
library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)

Call:
C5.0.formula(formula = myFormula, data = trainData)


C5.0 [Release 2.07 GPL Edition]     Fri Dec  1 06:15:21 2023
-------------------------------

Class specified by attribute `outcome'

Read 871 cases (5 attributes) from undefined.data

Decision tree:

low > 0.2960392:
:...open > 0.3106603: 2 (518)
:   open <= 0.3106603:
:   :...high <= 0.2916803: 1 (2)
:       high > 0.2916803: 2 (23)
low <= 0.2960392:
:...high <= 0.2892354: 1 (302)
    high > 0.2892354:
    :...open <= 0.278852: 2 (2)
        open > 0.278852:
        :...high <= 0.3075281: 1 (22/4)
            high > 0.3075281: 2 (2)


Evaluation on training data (871 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         7    4( 0.5%)   <<


       (a)   (b)    <-classified as
      ----  ----
       322          (a): class 1
         4   545    (b): class 2


    Attribute usage:

    100.00% low
     65.33% open
     40.53% high


Time: 0.0 secs
plot(CloseTree)

after doing all the three methods we have noticed that in IG and Gini Index(CART)

the Training(70%) and Testing(30%) has sensitivity = 0.9959016 specificity = 0.9685039 Accuracy = 0.9865229

the Training(60%) and Testing(40%) has sensitivity = 0.9969512 specificity = 0.9710983 Accuracy = 0.988024

the Training(80%) and Testing(20%) has sensitivity = 0.9940476 specificity = 0.9655172 Accuracy = 0.9843137

which means that the best spilting in our dataset is the Training(60%) and Testing(40%) because it is has the highest sensitivity = 0.9940476 %99.4 , specificity = 0.9655172 %96.5 , Accuracy = 0.988024 %98.8

Clustering is unsupervised learning, it doesn’t use a class label for implementing the cluster. To implement the clusters, we used the K-mean algorithm, which is an algorithm that produces K clusters, which each cluster is represented by the center point of the cluster and assigns each object to the nearest cluster, then iteratively recalculates the center, and reassigns the object until the center point of each cluster does not change that means the object in the right cluster.

factoextra packages is used to help in implementing the clustering technique. scale() method is used for scaling and centering of data set objects, Kmeans() method to find a specified number of clusters. fviz_cluster() method to visualize the clusters diagram. silhouette() method to calculate the average for each cluster, fviz_silhouette() to visualize it, and fviz_nbclust() method to set a comparison between the three different numbers of clusters to find the optimal number by evaluating the clusters according to how well the clusters are separated, and how compact the clusters are. In both techniques, we used the method set.seed() with the same random number each time we try a different size to ensure that we get the same result each time.

Data types should be transformed into numeric types before clustering.

# prepreocessing 
#Data types should be transformed into numeric types before clustering.
dataset <- scale(dataset)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
# k-means clustering to find 4 clusters 
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 4)

visualization of 4 clusters

# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)

average silhouette width for each clusters

#average silhouette for each clusters 
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset)) 
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations

total within-cluster sum of square and BCubed precision and recall

# Total sum of squares
kmeans.result$tot.withinss
[1] 1900.127
#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.50)
Error in BCubed_metric(kmeans.result$cluster, 0.5) : 
  length of category does not comply with length of cluster

print the clustering result

# print the clustering result
print(kmeans.result)

Apply k-means clustering for value 3

# run k-means clustering to find 3 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 3)

visualization of 3 clusters

# visualize clustering
#install.packages("factoextra")
library(factoextra)
Warning: package ‘factoextra’ was built under R version 4.3.2Loading required package: ggplot2
Warning: package ‘ggplot2’ was built under R version 4.3.2Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(kmeans.result, data = dataset)

average silhouette width for each clusters

#average silhouette for each clusters 
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset)) 
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations

total within-cluster sum of square and BCubed precision and recall

# Total sum of squares
kmeans.result$tot.withinss
[1] 2908.955
#bcubed metrix that take the avg of precision&recall
library('DPBBM')
Warning: package ‘DPBBM’ was built under R version 4.3.2
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.6)
Error in BCubed_metric(kmeans.result$cluster, 0.6) : 
  length of category does not comply with length of cluster

print the clustering result

# print the clustering result
print(kmeans.result)

Apply k-means clustering for value 2

# run k-means clustering to find 2 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 2)

visualization of 3 clusters

# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)

average silhouette width for each clusters

#average silhouette for each clusters 
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset)) 
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations

total within-cluster sum of square and BCubed precision and recall

# Total sum of squares
kmeans.result$tot.withinss
[1] 4126
#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.6)
Error in BCubed_metric(kmeans.result$cluster, 0.6) : 
  length of category does not comply with length of cluster

print the clustering result

# print the clustering result
print(kmeans.result)

kmeansruns() calls kmeans() to perform k-means clustering It initializes the k-means algorithm several times with random points from the data set as means. It estimates the number of clusters by index or average silhouette width

install.packages("fpc")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/fpc_2.2-10.zip'
Content type 'application/zip' length 839705 bytes (820 KB)
downloaded 820 KB
package ‘fpc’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\shade\AppData\Local\Temp\RtmpktBzoJ\downloaded_packages
library(fpc)
Warning: package ‘fpc’ was built under R version 4.3.2
#kmeansruns() : It calls  kmeans() to perform  k-means clustering
#It initializes the k-means algorithm several times with random points from the data set as means.
#It estimates the number of clusters by index or average silhouette width
kmeansruns.result <- kmeansruns(dataset)  
kmeansruns.result
K-means clustering with 4 clusters of sizes 290, 216, 408, 182

Cluster means:
         open     volume    adjClose     adjHigh      adjLow
1 -1.10632007 -0.4407852 -1.10386041 -1.10816456 -1.10089108
2  1.58007963  0.2687019  1.58272008  1.58333842  1.58085838
3  0.06915076 -0.5133559  0.06973682  0.06332081  0.07637218
4 -0.26746094  1.5342709 -0.27582770 -0.25532015 -0.29322444
      adjOpen  adjVolume
1 -1.10632007 -0.4407852
2  1.58007963  0.2687019
3  0.06915076 -0.5133559
4 -0.26746094  1.5342709

Clustering vector:
   1    2    3    5    6    7    8   10   11   12   13   14   15 
   1    1    1    1    1    1    1    4    1    1    1    1    1 
  16   17   18   19   20   21   22   23   24   25   26   27   28 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
  29   30   31   34   35   36   37   38   39   40   41   42   43 
   1    1    1    4    1    1    1    1    1    1    1    1    1 
  44   45   46   47   48   49   50   51   52   53   54   55   56 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
  57   58   59   60   61   62   63   64   65   66   67   68   69 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
  70   71   72   73   74   75   76   77   78   79   80   81   82 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
  83   84   85   86   87   88   89   90   91   92   93   94   95 
   1    1    1    1    1    1    4    1    1    1    1    1    1 
  96   98   99  100  101  102  103  104  105  107  109  110  111 
   4    4    4    1    1    4    1    1    4    4    4    1    1 
 112  113  114  115  116  117  118  119  121  122  123  124  125 
   1    1    1    1    1    4    1    4    1    1    1    1    1 
 126  127  128  129  130  131  132  133  134  135  136  137  138 
   1    4    4    1    1    4    1    1    1    1    1    1    1 
 139  140  141  142  143  144  145  146  147  148  149  150  151 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
 152  153  154  155  156  160  161  162  163  164  165  166  167 
   1    1    1    1    1    4    4    1    1    1    1    1    1 
 168  169  170  171  172  173  174  175  176  177  178  179  180 
   1    1    1    1    1    1    1    1    1    1    1    4    1 
 181  182  183  184  185  186  187  188  189  190  191  192  193 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
 194  195  197  198  199  200  201  202  203  204  205  206  207 
   4    1    4    1    1    1    1    1    1    1    1    1    1 
 208  209  210  211  212  213  214  215  216  217  218  219  220 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
 222  223  224  225  226  227  228  229  230  231  232  233  234 
   4    1    1    1    4    1    1    1    1    1    1    1    4 
 235  236  237  238  239  240  241  242  243  244  245  246  247 
   1    1    1    1    1    1    1    1    4    1    4    1    4 
 248  249  252  253  254  256  257  258  259  260  261  262  263 
   1    1    4    1    4    1    1    1    1    1    1    4    4 
 265  266  267  268  269  270  271  272  273  274  275  276  277 
   4    1    1    1    1    1    1    1    1    1    1    1    1 
 278  279  280  282  284  285  286  287  288  289  290  291  292 
   1    1    4    4    4    4    1    4    1    1    1    1    1 
 293  294  295  296  297  298  299  300  301  302  303  304  305 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
 306  307  308  309  310  311  312  313  314  315  316  317  318 
   1    1    1    1    1    1    1    1    1    1    1    1    4 
 319  320  321  322  323  324  325  326  327  328  329  330  331 
   1    1    1    1    1    4    1    4    1    1    1    1    1 
 332  333  334  335  336  337  338  339  340  341  342  343  344 
   1    1    1    1    1    1    1    1    1    1    1    1    1 
 345  346  347  349  350  351  352  353  354  355  356  357  358 
   1    1    1    4    3    3    3    3    3    3    3    3    3 
 359  360  361  362  363  364  365  366  367  368  369  370  371 
   3    3    3    3    3    3    3    3    3    3    3    4    4 
 372  373  374  375  376  377  378  379  380  381  383  384  385 
   4    4    4    3    3    3    3    3    3    3    3    3    3 
 386  387  388  389  390  391  392  393  394  395  396  397  398 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 399  400  401  402  403  404  405  406  407  408  409  410  411 
   3    3    3    3    3    3    3    3    3    3    4    3    3 
 412  413  417  418  420  421  422  423  424  425  426  427  428 
   3    4    4    4    4    3    3    4    3    3    3    3    3 
 429  430  431  432  433  434  435  436  437  438  439  440  441 
   3    4    4    4    4    3    3    3    3    4    4    4    3 
 442  443  444  445  446  447  448  449  452  453  454  455  456 
   3    4    4    4    3    4    4    4    4    4    4    4    3 
 457  458  459  460  461  462  463  464  465  466  467  468  470 
   4    4    4    3    3    3    3    4    3    4    4    4    4 
 471  472  473  474  475  476  477  478  479  480  481  482  483 
   4    3    4    3    3    4    4    3    3    4    3    3    3 
 484  485  486  487  488  489  490  491  492  493  494  496  497 
   3    3    3    3    3    3    3    3    3    4    3    4    4 
 498  499  500  501  502  503  504  505  506  507  508  509  510 
   3    3    3    3    3    3    3    3    4    3    3    3    3 
 511  512  513  514  515  516  517  518  519  520  521  522  523 
   3    4    3    3    3    3    3    3    3    3    3    3    3 
 524  525  526  527  528  529  530  531  533  534  535  536  537 
   3    3    3    3    3    3    3    4    4    4    4    4    3 
 538  539  540  541  542  543  544  545  546  547  548  549  550 
   3    3    3    3    3    3    3    3    3    3    4    3    3 
 551  552  553  554  555  556  557  558  559  560  561  562  563 
   3    3    3    3    3    3    3    3    3    4    4    4    4 
 564  565  566  567  568  569  570  571  572  573  575  576  577 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 578  579  580  581  582  583  584  585  586  587  589  590  591 
   3    3    3    3    3    4    3    4    3    4    4    3    4 
 592  593  594  595  596  597  598  602  603  604  605  606  607 
   3    4    3    3    4    4    4    4    3    4    4    3    4 
 608  609  610  611  612  613  614  615  616  617  618  619  620 
   3    3    3    3    3    4    3    4    4    3    3    4    4 
 621  622  623  624  625  626  627  628  629  630  631  632  633 
   4    3    4    4    4    4    4    4    3    3    3    4    4 
 634  635  636  638  639  640  641  642  643  644  645  646  647 
   4    4    4    1    4    4    3    3    3    4    4    4    4 
 648  649  650  651  652  653  654  655  656  657  658  659  660 
   3    3    3    3    3    3    3    4    3    3    3    3    3 
 661  662  663  664  665  667  668  669  670  671  672  673  674 
   3    3    3    3    4    4    4    3    3    3    3    3    3 
 675  676  677  678  679  680  681  682  683  684  685  686  687 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 688  689  690  691  692  693  694  695  696  697  698  699  700 
   3    3    4    3    3    4    3    3    4    3    3    3    4 
 701  702  703  704  705  706  707  708  709  710  711  712  713 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 714  715  716  717  718  719  720  721  722  723  725  726  727 
   3    3    3    3    3    3    3    3    3    4    4    4    4 
 728  729  730  731  732  733  734  735  736  737  738  739  740 
   3    3    3    3    3    4    4    4    3    3    3    3    3 
 741  742  743  744  745  746  749  750  751  752  753  754  755 
   3    3    3    3    3    3    4    4    4    3    3    3    3 
 756  757  758  759  760  761  762  763  764  765  766  767  768 
   3    3    3    3    3    4    3    3    4    3    4    3    3 
 769  770  771  772  773  774  775  776  777  778  779  780  781 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 782  783  784  786  787  788  789  790  791  792  793  794  795 
   3    3    4    4    3    3    3    3    4    3    3    3    3 
 796  797  798  799  800  801  802  803  804  805  806  807  808 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 809  810  811  812  813  814  815  816  817  818  819  820  821 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 822  823  824  825  826  827  828  829  830  831  832  833  834 
   3    3    4    3    3    3    3    3    3    3    3    3    3 
 835  836  837  838  839  840  841  842  843  844  845  846  847 
   3    3    3    3    3    3    3    3    3    3    3    3    3 
 848  849  850  851  852  853  854  855  856  857  858  859  860 
   3    3    4    4    3    3    3    3    3    3    4    3    3 
 861  862  863  864  865  866  867  868  869  870  871  872  873 
   3    3    3    2    3    3    3    3    3    3    3    3    3 
 874  875  876  877  878  879  880  881  882  883  884  885  886 
   3    3    3    3    3    3    3    3    3    2    3    2    2 
 887  889  890  891  892  893  894  895  896  897  898  899  900 
   3    3    3    3    3    3    3    3    3    2    2    2    2 
 901  902  903  904  905  906  907  908  909  910  911  912  913 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
 914  915  918  919  920  921  922  923  924  925  926  927  928 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
 929  931  932  935  936  937  938  939  941  942  955  956  957 
   2    2    2    4    4    2    4    4    4    4    4    4    4 
 958  959  960  961  962  963  964  965  966  967  968  969  970 
   4    4    4    4    4    4    3    4    3    4    4    3    4 
 971  972  973  974  977  978  979  980  981  982  983  984  985 
   4    3    3    3    4    4    3    2    3    2    2    2    2 
 986  987  988  989  990  991  992  993  994  995  996  997  998 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
1013 1014 1015 1016 1018 1019 1020 1021 1022 1023 1024 1025 1026 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
1040 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 
   2    2    2    2    2    2    2    2    2    2    2    2    2 
1054 1055 1056 1057 1058 1060 1061 1062 1063 1064 1066 1067 
   2    2    2    2    2    2    2    2    2    2    2    2 
 [ reached getOption("max.print") -- omitted 96 entries ]

Within cluster sum of squares by cluster:
[1] 384.3763 663.7649 449.0264 402.9592
 (between_SS / total_SS =  75.2 %)

Available components:

 [1] "cluster"      "centers"      "totss"        "withinss"    
 [5] "tot.withinss" "betweenss"    "size"         "iter"        
 [9] "ifault"       "crit"         "bestk"       
fviz_cluster(kmeansruns.result, data = dataset)

k-mediods clustering with PAM

#install.packages("cluster")
library(cluster)
# group into 4 clusters
pam.result <- pam(dataset, 4)
plot(pam.result)

Hierarchical Clustering draw a sample of 40 records from the dataset data, so that the clustering plot will not be over crowded

##----Hierarchical Clustering of the Data-----##
set.seed(2835)
# draw a sample of 40 records from the dataset data, so that the clustering plot will not be over crowded
idx <- sample(1:dim(dataset)[1], 40)
dataset2 <- dataset[idx, ]
## hiercrchical clustering
library(factoextra) 
hc.cut <- hcut(dataset2, k = 2, hc_method = "complete") # Computes Hierarchical Clustering and Cut the Tree
# Visualize dendrogram
fviz_dend(hc.cut,rect = TRUE)  #logical value specifying whether to add a rectangle around groups.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.

# Visualize cluster
fviz_cluster(hc.cut, ellipse.type = "convex") # Character specifying frame type. Possible values are 'convex', 'confidence' etc

define function to compute average silhouette for k clusters using silhouette()

silhouette_score <- function(k){ 
  km <- kmeans(USArrests, centers = k,nstart=25) # if centers is a number, how many random sets should be chosen?
  ss <- silhouette(km$cluster, dist(USArrests))
  sil<- mean(ss[, 3])
  return(sil)
}
# k cluster range from 2 to 10
k <- 2:10
##  call  function fore k value
avg_sil <- sapply(k, silhouette_score)  ##Apply a Function over a List or Vector
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)

silhouette method

#install.packages("NbClust")
library(NbClust)
#a)fviz_nbclust() with silhouette method using library(factoextra) 
fviz_nbclust(dataset, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
#b) NbClust validation
fres.nbclust <- NbClust(dataset, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")
# Elbow method for determining the optimal number of clusters (k-means)
wss <- numeric(length = 10)
for (k in 1:10) {
  kmeans_model <- kmeans(dataset, centers = k, nstart = 10)
  wss[k] <- sum(kmeans_model$close)
}

after doing 3 sizes of k and based the plot and drawing we have noticed that The best size is K 2 , it is Partition better than the other

# Extract the total within-cluster sum of squares (TWSS)
twss <- sum(kmeans.result$withinss)
# Print the TWSS
cat(paste("Total Within-Cluster Sum of Squares (TWSS):", twss, "\n"))
# Evaluate BCubed precision and recall for k-medoids
# Install and load required libraries
library(caret)
library(ggplot2)
library(lattice)


# Assuming you have true labels and predicted labels
true_labels <- c(1, 1, 1, 0, 0, 1, 0, 1, 0, 1)
predicted_labels <- c(1, 0, 1, 0, 0, 1, 0, 1, 1, 1)

# Create a confusion matrix
conf_matrix <- confusionMatrix(factor(predicted_labels), factor(true_labels))

# Extract recall from the confusion matrix
recall <- conf_matrix$byClass["Sensitivity"]

# Print the result
cat(paste("Recall:", recall, "\n"))

7. Findings

Our dataset represents opening and closing prices of google stocks in market. Our goal was to predict higher closing prices that indicate a positive trend in Google stock. To have the best, accurate, and precise results we used several data mining preprocessing techniques that improve the efficiency of the data. applied several plotting methods was applied to help us understand our data. Based on plots we removed outliers, we didn’t find any null or missing values. And then data transformation was applied to transform attribute values such as normalization discretization.

Then we applied the data mining tasks, that are classification and clustering. For classification, we use the decision tree method to construct our model, 3 different sizes of training and testing data were used to get the best result for construction and evaluation. the following results for different sizes:

In conclusion, the most accurate model and the best spilting in our dataset is the Training(60%) and Testing(40%) because it is has the highest sensitivity = 0.9940476 %99.4 , specificity = 0.9655172 %96.5 , Accuracy = 0.988024 %98.8.

For Clustering, 3 different sizes K were used in K-means algorithm to find the optimal number of clusters. average silhouette width for each K was calculated to conclude shown results.

Since the highest average silhouette width is where the number of clusters equals to 2 it has the optimal number of clusters. The higher the average silhouette width the closer the objects within the same cluster to each other and as far as possible to the objects in the other cluster.
At the end, both models are helpful and helped us in predicting. But since our dataset is numeric and after doing the clustering and Classification we have noticed that the clustering fits more for the dataset because it’s concept all about the numeric data.

7. Refrences

---
title: "R Notebook"
output: html_notebook
editor_options: 
  markdown: 
    wrap: 72
---

# **Closing price of Google Stock Prediction**

#  1.  Problem

Predicting the closing price of a stock is a complex problem because of several challenges.
Stock prices are influenced by a multitude of factors such as market trends, Analyzing and incorporating all these factors accurately into a predictive model is a complex task.
Market volatility makes predicting stock prices accurately challenging.
Data Quality and Quantity, the pursuit of solving this problem is crucial because accurate stock price predictions have significant implications for investors, financial institutions, and businesses. Accurate predictions can aid investors in making informed decisions.
The importance of predicting stock prices lies in its implications for investors, financial institutions, and businesses, it can potentially help investors make more informed decisions about buying, selling, or holding stocks, aiding in risk.


#  2.  Data mining Task

In our project, we will use two data mining tasks to help us predict the closing price of a stock. two of the methods you can consider are classification and clustering.
For classification, we will train our model to be able to classify the close price based on a set of attributes such as volume, open, high, low, length etc. For clustering, we will partition closing prices into subnets or clusters, where they are similar to prices in cluster but dissimilar to prices in other clusters based on the attributes Low, Heigh, Open, volume, adjClose, adjHigh.


#  3.  Data

Our dataset is from the source:
<https://www.kaggle.com/datasets/shreenidhihipparagi/google-stock-prediction>

Number of Attributes: 14

Number of objects: 1258

Attribute characteristics:

+------------+---------+-----------------------------------------------+
| Attribute  | Data    | Description                                   |
| Name       | Type    |                                               |
+------------+---------+-----------------------------------------------+
| symbol     | unique  | Name of company                               |
|            | value   |                                               |
+------------+---------+-----------------------------------------------+
| date       | numeric | date: day, month, and year.                   |
+------------+---------+-----------------------------------------------+
| close      | numeric | closing price of a stock is the final price   |
|            |         | at which a stock is traded on a given trading |
|            |         | day.                                          |
+------------+---------+-----------------------------------------------+
| high       | numeric | The highest price at which a stock traded     |
|            |         | during a specific trading day.                |
+------------+---------+-----------------------------------------------+
| low        | numeric | The lowest price at which a stock traded      |
|            |         | during a specific trading day.                |
+------------+---------+-----------------------------------------------+
| open       | numeric | The price of a stock at the beginning of a    |
|            |         | trading day. It's the price at which the      |
|            |         | first trade occurred on that day.             |
+------------+---------+-----------------------------------------------+
| Volume     | numeric | The total number of shares traded during a    |
|            |         | trading day. Volume is a measure of market    |
|            |         | activity and liquidity for a stock            |
+------------+---------+-----------------------------------------------+
| adjClose   |         | The closing price of a stock adjusted for any |
|            | numeric | corporate actions like dividends, stock       |
|            |         | splits, or other events that could affect the |
|            |         | stock price.                                  |
+------------+---------+-----------------------------------------------+
| adjHigh    | numeric | The highest price of a stock during a trading |
|            |         | day, adjusted for any corporate actions       |
+------------+---------+-----------------------------------------------+
| adjLow     | numeric | The lowest price of a stock during a trading  |
|            |         | day, adjusted for any corporate actions.      |
+------------+---------+-----------------------------------------------+
| adjOpen    | numeric | The opening price of a stock at the beginning |
|            |         | of a trading day, adjusted for any corporate  |
|            |         | actions.                                      |
+------------+---------+-----------------------------------------------+
| adjVolume  | numeric | The trading volume of a stock adjusted for    |
|            |         | any corporate actions. This can provide a     |
|            |         | clearer picture of tranding activity.         |
+------------+---------+-----------------------------------------------+
| divCash    | Binary  | The amount of money paid by a company to its  |
|            |         | shareholders as a portion of its profits.     |
|            |         | Dividends are typically paid on a per-share   |
|            |         | basis                                         |
+------------+---------+-----------------------------------------------+
| s          | Binary  | If a stock undergoes a stock split, the split |
| plitFactor |         | factor indicates the ratio by which the       |
|            |         | shares were split. For instance, a 2-for-1    |
|            |         | split means that for every old share, you now |
|            |         | have 2 new shares.                            |
+------------+---------+-----------------------------------------------+

```{r}
# Load necessary packages
if (!require(caret)) {
  install.packages("caret")
}
if (!require(cluster)) {
  install.packages("cluster")
}
if (!require(fpc)) {
  install.packages("fpc")
}
if (!require(ggplot2)) {
  install.packages("ggplot2")
}
library(caret)
library(cluster)
library(fpc)
library(ggplot2)
```


-   **Sample of row**

```{r}
dataset = read.csv('Google.csv') 
```

```{r}
View(dataset)
print(dataset)
```


we removed the attributes (symbol, divCash, splitFactor) as they have
one value only so we do not need them

```{r}
dataset=dataset[,2:12]
```


*Convert the date column to a date format*

```{r}
dataset$date <- as.Date(dataset$date, format = "%Y-%m-%d %H:%M:%S")
```
 
 
```{r}
print(dataset)
str(dataset)
```


-   **Statiscal summarise**

```{r}
summary(dataset)
```


mean of closing price Using the mean closing price can serve as a basic
reference point or a simple benchmark for forecasting future stock
prices. The mean closing price is the average price at which a stock has
closed over a specific period.

```{r}
mean(dataset$close)
```


**variance Code**

The concept of variance in the context of closing prices for stock
prediction serves to quantify the spread or dispersion of the closing
prices around their mean or average value. It provides a measure of how
much the actual closing prices deviate from the average closing price
over a specific period.

```{r}
var(dataset$close)
```


-   **Statiscal summarise**

Summaries for all numeric attributes and their outliers and boxplots.

```{r}
#stastistical measures
#summaries
summary(dataset$close)
summary(dataset$high)
summary(dataset$low)
summary(dataset$open)
summary(dataset$volume)
summary(dataset$adjClose)
summary(dataset$adjHigh)
summary(dataset$adjLow)
summary(dataset$adjOpen)
summary(dataset$adjVolume)
```


-   **Outliers**

```{r}
#outliers
boxplot.stats(dataset$close)$out
boxplot.stats(dataset$high)$out
boxplot.stats(dataset$low)$out
boxplot.stats(dataset$open)$out
boxplot.stats(dataset$volume)$out
boxplot.stats(dataset$adjClose)$out
boxplot.stats(dataset$adjHigh)$out
boxplot.stats(dataset$adjLow)$out
boxplot.stats(dataset$adjOpen)$out
boxplot.stats(dataset$adjVolume)$out
```


-   **Boxplots**

```{r}
#boxplots
boxplot(dataset$close)
boxplot(dataset$high)
boxplot(dataset$low)
boxplot(dataset$open)
boxplot(dataset$volume)
boxplot(dataset$adjClose)
boxplot(dataset$adjHigh)
boxplot(dataset$adjLow)
boxplot(dataset$adjOpen)
boxplot(dataset$adjVolume)
```


- 	**Plotting methods**


-   **Scatter Plot**

This scatter plot helps us to determine whether the closing price and
volume are correlated to each other or not, it shows that the two
attributes are corelated and have proportional relationship.

```{r}
with(dataset, plot(volume, close))
```


-   **Barplot**

The Bar plot represents the closing price and date in dataset. It indicates that closing prices at the end of a traded day are increasing or decreasing depending on the date.

```{r}
barplot(height = dataset$close, names.arg = dataset$date, xlab = "Date", ylab = "Closing price", main = "date vs Close")
```


-   **Histogram**

This Histogram represents the frequency of a stock closing price in the dataset. After observation, we noticed that the most values lie in between 1000 to 1200.

```{r}
hist(dataset$close)
```


#  4.  Data preprocessing


-   **Raw dataset**

Here is our data set before preprocessing

```{r}
#dataset before preprocessing
print(dataset)
```


-    **Checking for missing values**

Data cleaning, including handling missing values like NULLs, is crucial before utilizing data for analysis or modeling. It’s important to get the best quality of analysis. Such as accuracy where missing or incorrect data can skew analysis, leading to inaccurate insights or predictions. And clean data ensures the reliability of your findings, reducing the risk of making decisions based on flawed information.

*to find the total null values in the dataset #Checking NULL, FALSE
means no null, TRUE cells means the value of the cell is null*

```{r}
is.na(dataset)
sum(is.na(dataset))

print("Since there is no NULL values we don't need to remove any rows")
```

In our data since there are no Null values, we don’t need to remove any rows.


-    **Detecting and removing the outliers**

Since most attributes in our dataset are numeric and removing outliers will affect our calculations and prediction, we will remove closing price and volumes outliers only.

```{r}
#dataset before removing outliers
print(dataset)
summary(dataset)
str(dataset)

#removing close outlier
outliers <- boxplot(dataset$close, plot=FALSE)$out
dataset <- dataset[-which(dataset$close %in% outliers),]
boxplot.stats(dataset$close)$out

#removing volume's outlier
outliers <- boxplot(dataset$volume, plot=FALSE)$out
dataset <- dataset[-which(dataset$volume %in% outliers),]
boxplot.stats(dataset$volume)$out

#data set after removing outliers
print(dataset)
summary(dataset)
str(dataset)
```



-   **Data transformation**

**Feature selection**

Remove Redundant Features

```{r}
# load the library        
library(mlbench)
library(caret)
library(ggplot2)
library(lattice)

# calculate correlation matrix
correlationMatrix <- cor(dataset[,2:11])

# summarize the correlation matrix
print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5 )

# print indexes of highly correlated attributes
print(highlyCorrelated)
```


-   **Normalization**


dataset before normalization 

```{r}
#dataset before normalization 
print(dataset)
summary(dataset)
str(dataset)
```


normalization was performed to ensure consistent scaling of the data.
The normalization technique applied was the max-min normalization. This
technique rescales the values of specific attributes within a defined
range between 0 and 1.

We can use the normalized dataset provides a more uniform and comparable
representation of the attributes, enabling accurate analysis and
modeling for stock predaction with result as shown.

```{r}
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataWithoutNormalization <- dataset
dataset$close<-normalize(dataWithoutNormalization$close)
dataset$volume<-normalize(dataWithoutNormalization$volume)
dataset$open<-normalize(dataWithoutNormalization$open)
dataset$low <-normalize(dataWithoutNormalization$low)
dataset$high <-normalize(dataWithoutNormalization$high)
```


dataset after normalization

```{r}
#dataset after normalization 
print(dataset)
summary(dataset)
str(dataset)
```


-   **Discretization**


dataset before Discretization 

```{r}
#dataset before Discretization 
print(dataset)
summary(dataset)
str(dataset)
```


we used the Discretization technique on our class label "close" to
simplify it as it has a large continuous values, we made them fall into
intervals, to make it easier to analyze

and we chose the value 0.2957251 as it the mean value for the closing

```{r}
dataset$close <- ifelse(dataset$close <= 0.2957251 , "low","High")
print(dataset)
```

we discretized it into two categories (low, high) based on the maen, low
meaning it is less than the mean of the close , and high meaning it is
equal to or higher than the mean.


Encoding 
We encoded close data into factors, which would help the model read this data easily

```{r}

dataset$close <- factor(dataset$close,levels = c("low", "High"), labels = c("1", "2"))

print(dataset)
```


dataset after Discretization

```{r}
#dataset after Discretization 
print(dataset)
summary(dataset)
str(dataset)
```


summary after preprocessing after preprocessing the data for stock
price prediction, several steps are taken to refine, clean, and prepare
the data for analysis and modeling. These preprocessing steps aim to
enhance the quality and reliability of the data for more accurate stock
price prediction.

dataset after preprocessing

```{r}
#dataset after preprocessing 
print(dataset)
summary(dataset)
str(dataset)
```


**Feature selection**

Feature selection is a process of selecting a subset of relevant
features (or attributes) from the original set of features in a dataset.
The goal of feature selection is to choose the most relevant and
important features, thereby reducing dimensionality, and improving model
performance.

#Feature selection ,Feature selection using Recursive Feature
Elimination or RFE

```{r}
    library(mlbench)
library(caret)

# define the control using a random forest selection function 
# number=12 means the length of the list
control <- rfeControl(functions=rfFuncs, method="cv", number=11)
# run the RFE algorithm from column 1 to 11  
results <- rfe(dataset[,1:10],dataset[,11], sizes=c(1:10), rfeControl=control)
```

summarize the results

```{r}
print(results)
```

list the chosen features

```{r}
predictors(results)
```

plot the results

```{r}
plot(results, type=c("h", "o"))
```


#  5.   Data Mining Techniques

We did both supervised and unsupervised learning techniques on our
dataset (Google stock prediction), which involves classification and
clustering methods, for classification we did a partitioning method
called the train-test split, which splits the dataset into two subsets
of different ratios, and we implemented three algorithms to form 9
different decision trees.



#  6.   Evaluation and Comparison


-   **Classification**

We will choose the attributes with the highest importance (from feature
selection) to create a tree:


1. Dividing the dataset:

we divided our dataset into two divisions for each split:

first one 70-30, which means Training(70%) and Testing(30%):

```{r}
# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(70%) and Testing(30%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c( 0.70, 0.30))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
```


2. Determine the predictor attributes and the class label attribute.( the formula):

```{r}
library(party)    
#myFormula 
myFormula <- close ~volume+open+high+low

```


1. Build a decision tree using Information gain:

Information gain is a concept used in the field of machine learning and
decision tree algorithms. It is a measure of the effectiveness of a
particular attribute in classifying data. In the context of decision
trees, information gain helps determine the order in which attributes
are chosen for splitting the data.

```{r}
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
# 4.Print and plot the tree:

print(dataset_ctree)
plot(dataset_ctree, type="simple")
```

```{r}
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
```

```{r}
# Evaluate the model and create confusion matrix
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
sensitivity(as.table(co_result))
specificity(as.table(co_result))
precision(as.table(co_result))

acc <- co_result$overall["Accuracy"]
acc
```


2. Building the Tree using Gini Index(CART)

The Gini Index is another criterion used in decision tree algorithms,
particularly in the context of the Classification and Regression Trees
(CART) algorithm. Like information gain, the Gini Index is used to
evaluate the impurity or homogeneity of a dataset.

The Gini Index for a specific attribute measures the probability of
incorrectly classifying a randomly chosen element in the dataset. A
lower Gini Index indicates a purer or more homogeneous set. In the
context of decision trees, the attribute with the lowest Gini Index is
chosen as the split attribute.

```{r}
# For decision tree model
install.packages("rpart")
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))
```


Visualizing the unpruned tree

```{r}
library(rpart.plot)
rpart.plot(dataset.cart)
```


Checking the order of variable importance

```{r}
dataset.cart$variable.importance
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
```

```{r}
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
```

```{r}
# Evaluate the model and create confusion matrix
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
sensitivity(as.table(co_result))
specificity(as.table(co_result))
precision(as.table(co_result))

acc <- co_result$overall["Accuracy"]
acc
```


3. Building the Tree using Gain ratio(C5)

The Gain Ratio is used to select the attribute that maximizes the
Information Gain while avoiding the bias towards attributes with many
values. It provides a more balanced measure for attribute selection in
decision tree construction.

While Information Gain simply measures the reduction in entropy or
uncertainty, Gain Ratio takes into account the intrinsic information of
an attribute. It aims to penalize attributes that may have a large
number of values, potentially leading to overfitting.

```{r}
install.packages("caret")
install.packages("C50")
install.packages("printr")

library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)
plot(CloseTree)
```


second one 60-40, which means Training(60%) and Testing(40%):

```{r}
# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(60%) and Testing(40%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.60 , 0.40))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
```


2. Determine the predictor attributes and the class label attribute.( the formula):

```{r}
library(party)    
#myFormula 
myFormula <- close ~volume+open+high+low
```


3. Build a decision tree using training set and check the Prediction:

```{r}
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
# 4.Print and plot the tree:

print(dataset_ctree)
plot(dataset_ctree, type="simple")
```

```{r}
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
```

```{r}
# Evaluate the model and create confusion matrix
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
sensitivity(as.table(co_result))
specificity(as.table(co_result))
precision(as.table(co_result))

acc <- co_result$overall["Accuracy"]
acc
```


2.  Building the Tree using Gini Index(CART)

```{r}
# For decision tree model
install.packages("rpart")
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))
```


Visualizing the unpruned tree

```{r}
rpart.plot(dataset.cart)
```


Checking the order of variable importance

```{r}
dataset.cart$variable.importance
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
```

```{r}
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
```

```{r}
# Evaluate the model and create confusion matrix
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
sensitivity(as.table(co_result))
specificity(as.table(co_result))
precision(as.table(co_result))

acc <- co_result$overall["Accuracy"]
acc
```


3.  Building the Tree using Gain ratio(C5)

```{r}
install.packages("caret")
install.packages("C50")
install.packages("printr")

library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)
plot(CloseTree)
```


Third one 80-20, which means Training(80%) and Testing(20%):

```{r}
# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(80%) and Testing(20%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.80 , 0.20))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
```


2.Determine the predictor attributes and the class label attribute.( the formula):

```{r}
library(party)    
#myFormula 
myFormula <- close ~volume+open+high+low

```


3.Build a decision tree using training set and check the Prediction:

```{r}
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
# 4.Print and plot the tree:

print(dataset_ctree)
plot(dataset_ctree, type="simple")
```

```{r}
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
```

```{r}
# Evaluate the model and create confusion matrix
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
sensitivity(as.table(co_result))
specificity(as.table(co_result))
precision(as.table(co_result))

acc <- co_result$overall["Accuracy"]
acc
```


2.  Building the Tree using Gini Index(CART)

```{r}
# For decision tree model
install.packages("rpart")
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))
```


Visualizing the unpruned tree

```{r}
library(rpart.plot)
rpart.plot(dataset.cart)
```


Checking the order of variable importance

```{r}
dataset.cart$variable.importance
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
```

```{r}
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
```

```{r}
# Evaluate the model and create confusion matrix
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
sensitivity(as.table(co_result))
specificity(as.table(co_result))
precision(as.table(co_result))

acc <- co_result$overall["Accuracy"]
acc
```


3.  Building the Tree using Gain ratio(C5)

```{r}
install.packages("caret")
install.packages("C50")
install.packages("printr")

library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)
plot(CloseTree)
```


after doing all the three methods we have noticed that in IG and Gini
Index(CART)

the Training(70%) and Testing(30%) has sensitivity = 0.9959016
specificity = 0.9685039 Accuracy = 0.9865229

the Training(60%) and Testing(40%) has sensitivity = 0.9969512
specificity = 0.9710983 Accuracy = 0.988024

the Training(80%) and Testing(20%) has sensitivity = 0.9940476
specificity = 0.9655172 Accuracy = 0.9843137

which means that the best spilting in our dataset is the *Training(60%)
and Testing(40%)* because it is has the highest sensitivity = 0.9940476
%99.4 , specificity = 0.9655172 %96.5 , Accuracy = 0.988024 %98.8




-   **Clustering**

Clustering is unsupervised learning, it doesn’t use a class label for implementing the cluster. To implement the clusters, we used the K-mean algorithm, which is an algorithm that produces K clusters, which each cluster is represented by the center point of the cluster and assigns each object to the nearest cluster, then iteratively recalculates the center, and reassigns the object until the center point of each cluster does not change that means the object in the right cluster.

factoextra packages is used to help in implementing the clustering technique. scale() method is used for scaling and centering of data set objects, Kmeans() method to find a specified number of clusters. fviz_cluster() method to visualize the clusters diagram. silhouette() method to calculate the average for each cluster, fviz_silhouette() to visualize it, and fviz_nbclust() method to set a comparison between the three different numbers of clusters to find the optimal number by evaluating the clusters according to how well the clusters are separated, and how compact the clusters are. In both techniques, we used the method set.seed() with the same random number each time we try a different size to ensure that we get the same result each time.


Data types should be transformed into numeric types before clustering.

```{r}
# prepreocessing 
#Data types should be transformed into numeric types before clustering.
dataset <- scale(dataset)
View(dataset)
```


- Apply k-means clustering for value 4

```{r}
# k-means clustering to find 4 clusters 
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 4)
```

visualization of 4 clusters

```{r}
# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)
```

average silhouette width for each clusters 

```{r}
#average silhouette for each clusters 
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset)) 
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
```

total within-cluster sum of square and BCubed precision and recall

```{r}
# Total sum of squares
kmeans.result$tot.withinss

#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.50)
```

print the clustering result
```{r}
# print the clustering result
print(kmeans.result)
```


Apply k-means clustering for value 3

```{r}
# run k-means clustering to find 3 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 3)
```

visualization of 3 clusters

```{r}
# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)
```

average silhouette width for each clusters 
```{r}
#average silhouette for each clusters 
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset)) 
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
```

total within-cluster sum of square and BCubed precision and recall

```{r}
# Total sum of squares
kmeans.result$tot.withinss

#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.6)
```

print the clustering result
```{r}
# print the clustering result
print(kmeans.result)
```


Apply k-means clustering for value 2

```{r}
# run k-means clustering to find 2 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 2)
```


visualization of 3 clusters

```{r}
# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)
```

average silhouette width for each clusters 
```{r}
#average silhouette for each clusters 
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset)) 
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
```

total within-cluster sum of square and BCubed precision and recall

```{r}
# Total sum of squares
kmeans.result$tot.withinss

#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.6)
```

print the clustering result
```{r}
# print the clustering result
print(kmeans.result)
```



kmeansruns() calls  kmeans() to perform  k-means clustering
It initializes the k-means algorithm several times with random points from the data set as means.
It estimates the number of clusters by index or average silhouette width

```{r}
install.packages("fpc")
library(fpc)
#kmeansruns() : It calls  kmeans() to perform  k-means clustering
#It initializes the k-means algorithm several times with random points from the data set as means.
#It estimates the number of clusters by index or average silhouette width
kmeansruns.result <- kmeansruns(dataset)  
kmeansruns.result
fviz_cluster(kmeansruns.result, data = dataset)
```


k-mediods clustering with PAM

```{r}
#install.packages("cluster")
library(cluster)
# group into 4 clusters
pam.result <- pam(dataset, 4)
plot(pam.result)
```


Hierarchical Clustering
draw a sample of 40 records from the dataset data, so that the clustering plot will not be over crowded

```{r}
##----Hierarchical Clustering of the Data-----##
set.seed(2835)
# draw a sample of 40 records from the dataset data, so that the clustering plot will not be over crowded
idx <- sample(1:dim(dataset)[1], 40)
dataset2 <- dataset[idx, ]
## hiercrchical clustering
library(factoextra) 
hc.cut <- hcut(dataset2, k = 2, hc_method = "complete") # Computes Hierarchical Clustering and Cut the Tree

```


```{r}
# Visualize dendrogram
fviz_dend(hc.cut,rect = TRUE)  #logical value specifying whether to add a rectangle around groups.
# Visualize cluster
fviz_cluster(hc.cut, ellipse.type = "convex") # Character specifying frame type. Possible values are 'convex', 'confidence' etc

```



define function to compute average silhouette for k clusters using
silhouette()

```{r}
silhouette_score <- function(k){ 
  km <- kmeans(USArrests, centers = k,nstart=25) # if centers is a number, how many random sets should be chosen?
  ss <- silhouette(km$cluster, dist(USArrests))
  sil<- mean(ss[, 3])
  return(sil)
}
```

-  Optimal number of clusters:

```{r}
# k cluster range from 2 to 10
k <- 2:10
##  call  function fore k value
avg_sil <- sapply(k, silhouette_score)  ##Apply a Function over a List or Vector
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)

```


silhouette method

```{r}
#install.packages("NbClust")
library(NbClust)
#a)fviz_nbclust() with silhouette method using library(factoextra) 
fviz_nbclust(dataset, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
```

```{r}
#b) NbClust validation
fres.nbclust <- NbClust(dataset, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")
```

```{r}
# Elbow method for determining the optimal number of clusters (k-means)
wss <- numeric(length = 10)
for (k in 1:10) {
  kmeans_model <- kmeans(dataset, centers = k, nstart = 10)
  wss[k] <- sum(kmeans_model$close)
}
```


after doing 3 sizes of k and based the plot and drawing we have noticed
that The best size is K 2 , it is Partition better than the other

```{r}
# Extract the total within-cluster sum of squares (TWSS)
twss <- sum(kmeans.result$withinss)
# Print the TWSS
cat(paste("Total Within-Cluster Sum of Squares (TWSS):", twss, "\n"))
```

```{r}
# Evaluate BCubed precision and recall for k-medoids
# Install and load required libraries
library(caret)
library(ggplot2)
library(lattice)


# Assuming you have true labels and predicted labels
true_labels <- c(1, 1, 1, 0, 0, 1, 0, 1, 0, 1)
predicted_labels <- c(1, 0, 1, 0, 0, 1, 0, 1, 1, 1)

# Create a confusion matrix
conf_matrix <- confusionMatrix(factor(predicted_labels), factor(true_labels))

# Extract recall from the confusion matrix
recall <- conf_matrix$byClass["Sensitivity"]

# Print the result
cat(paste("Recall:", recall, "\n"))
```


#  7.  Findings

Our dataset represents opening and closing prices of google stocks in market. Our goal was to predict higher closing prices that indicate a positive trend in Google stock. 
To have the best, accurate, and precise results we used several data mining preprocessing techniques that improve the efficiency of the data. applied several plotting methods was applied to help us understand our data. Based on plots we removed outliers, we didn’t find any null or missing values. And then data transformation was applied to transform attribute values such as normalization discretization.

Then we applied the data mining tasks, that are classification and clustering. 
For classification, we use the decision tree method to construct our model, 3 different sizes of training and testing data were used to get the best result for construction and evaluation.
the following results for different sizes:

- 70% Training and 30% Testing data
- Information Gain:
  Accuracy = 0.9865229
  precision = 
  sensitivity = 0.9959016 
  specificity = 0.9685039 
  
- Information Gain ratio:
  Accuracy = 
  precision = 
  sensitivity = 
  specificity = 
  
- Information Gain index:
  Accuracy = 
  precision = 
  sensitivity = 
  specificity = 
  
- 60% Training and 40% Testing data
- Information Gain:
  Accuracy = 0.988024
  precision = 
  sensitivity = 0.9969512 
  specificity = 0.9710983 
  
- Information Gain ratio:
  Accuracy = 
  precision = 
  sensitivity = 
  specificity = 
  
- Information Gain index:
  Accuracy = 
  precision = 
  sensitivity = 
  specificity = 
  
- 80% Training and 20% Testing data
- Information Gain:
  Accuracy = 0.9843137
  precision = 
  sensitivity = 0.9940476 
  specificity = 0.9655172 
  
- Information Gain ratio:
  Accuracy = 
  precision = 
  sensitivity = 
  specificity = 
  
- Information Gain index:
  Accuracy = 
  precision = 
  sensitivity = 
  specificity = 


In conclusion, the most accurate model and the best spilting in our dataset is the Training(60%) and Testing(40%) because it is has the highest sensitivity = 0.9940476 %99.4 , specificity = 0.9655172 %96.5 , Accuracy = 0.988024 %98.8.

For Clustering, 3 different sizes K were used in K-means algorithm to find the optimal number of clusters. average silhouette width for each K was calculated to conclude shown results.

- Number of cluster(K)= 4 
  average silhouette width=0.43
  sum of squares= 1900.127
  BCubed precision= 
  BCubed recall= 
  
- Number of cluster(K)= 3
  average silhouette width=0.37
  sum of squares= 2908.955
  BCubed precision= 
  BCubed recall= 
  
- Number of cluster(K)= 2
  average silhouette width=0.45
  sum of squares= 4126
  BCubed precision= 
  BCubed recall= 

Since the highest average silhouette width is where the number of clusters equals to 2 it has the optimal number of clusters. The higher the average silhouette width the closer the objects within the same cluster to each other and as far as possible to the objects in the other cluster.  
At the end, both models are helpful and helped us in predicting. But since our dataset is numeric and after doing the clustering and Classification we have noticed that the clustering fits more for the dataset because it's concept all about the numeric data.

#  7.  Refrences



